You are on page 1of 275

CRITICAL ASSESSMENT OF TECHNIQUES

FOR PROTEIN STRUCTURE PREDICTION

12

Twelfth meeting
Gaeta, Italy
DECEMBER 10-13, 2016

1
TABLE OF CONTENTS
AIR ....................................................................................................................................................................... 10
AIR: AN ARTIFICIAL INTELLIGENCE-BASED PROTOCOL FOR PROTEIN STRUCTURE REFINEMENT USING MULTI-OBJECTIVE PARTICLE SWARM
OPTIMIZATION ............................................................................................................................................................. 10

AKBAR ................................................................................................................................................................. 12
AKBAR: A GBM META-CLASSIFIER COMBINING CONTACT DETECTING METHODS AND SEQUENCE DERIVED FEATURES ........................12
ATOME2_CBS ....................................................................................................................................................... 14
@TOME-2: A PIPELINE FOR COMPARATIVE MODELING OF PROTEIN-LIGAND COMPLEXES. ..........................................................14
BAKER (REFINEMENT) .......................................................................................................................................... 16
ADDRESSING LOW-RESOLUTION REFINEMENT CHALLENGES USING ROSETTA AND MD IN CASP12 ................................................16
BAKER-ROSETTASERVER ....................................................................................................................................... 18
IMPROVING ROBETTA USING META-GENOME SEQUENCES, ITERATIVE REFINEMENT, AND PROQ2 ..................................................18
BATES_BMM ........................................................................................................................................................ 21
PROTEIN MODEL CONSTRUCTION, OPTIMIZATION AND DOCKING USING PARTICLE SWARM OPTIMIZATION........................................21
BATES_BMM (REFINEMENT) ................................................................................................................................ 23
PHYSICS-ONLY BASED REFINEMENT OF PREDICTED PROTEIN FOLDS IN CONTACT-MAP SPACE .........................................................23
BHAGEERATHH-PLUS ............................................................................................................................................ 25
BHAGEERATHH+: A HYBRID METHODOLOGY BASED SOFTWARE SUITE FOR PROTEIN TERTIARY STRUCTURE PREDICTION ......................25
CAMD-BIR_CUBA ................................................................................................................................................. 27
AB INITIO STRUCTURE PREDICTION LEVERAGING THE ENTROPIC CONTRIBUTION OF THE HYDROPHOBIC EFFECT IN AN EMPIRICAL FREE-
ENERGY FUNCTION.......................................................................................................................................................27
CHUO-U ............................................................................................................................................................... 29
3-DIMENSIONAL MODELS CREATED BY CHUO-U TEAM USING THE FAMS PROGRAM, STATISTICAL POTENTIALS AND SOME INSPECTION..29
CHUO-U-SERVER .................................................................................................................................................. 30
3-DIMENSIONAL MODELS CREATED BY CHUO-FAMS-SERVER TEAM USING THE FAMS PROGRAM AND STATISTICAL POTENTIALS ...........30
CPCLAB ................................................................................................................................................................ 31
TOPMODEL: A MULTI-TEMPLATE META-APPROACH TO HOMOLOGY MODELLING .....................................................................31
DEEPFOLD-CONTACT, DEEPFOLD-BOOM .............................................................................................................. 33
PROTEIN CONTACT PREDICTION WITH DEEP FULLY CONVOLUTIONAL NEURAL NETWORK............................................................33
DELCLAB .............................................................................................................................................................. 34
PROTEIN 3D STRUCTURE PREDICTION BY COMBINING SPECTRAL AND SEQUENCE HOMOLOGY OF AMINO ACID SEQUENCES .................34
DISTILL ................................................................................................................................................................. 36
DISTILL FOR CASP12 ...................................................................................................................................................36
EDAROSE.............................................................................................................................................................. 38
ESTIMATION OF DISTRIBUTION AND DISTANCE RESTRAINTS FOR FRAGMENT-BASED PROTEIN STRUCTURE PREDICTION .....................38
ELOFSSON ............................................................................................................................................................ 40

2
MANUAL MODEL RANKING BASED ON AGREEMENT WITH CONTACT MAPS ................................................................................40
FALCON_COLORS ................................................................................................................................................. 42
IMPROVING RESIDUE-RESIDUE CONTACT PREDICTION VIA LOW-RANK AND SPARSE DECOMPOSITION OF RESIDUE CORRELATION MATRIX .42
FALCON_TOPO ..................................................................................................................................................... 45
IMPROVING PROTEIN THREADING ACCURACY VIA COMBINING LOCAL AND GLOBAL POTENTIAL USING TREECRF MODEL .................45
FARAGGI .............................................................................................................................................................. 47
FDUBIO ................................................................................................................................................................ 48
SORTING PROTEIN DECOY MODELS BY LEARNING-TO-RANK ....................................................................................................48
FEIG ..................................................................................................................................................................... 50
(LOC)PREFMD: PROTEIN STRUCTURE REFINEMENT VIA MOLECULAR DYNAMICS SIMULATIONS ..................................................50
FLOUDAS.............................................................................................................................................................. 52
CONSTRAINED GREY BOX GLOBAL OPTIMIZATION FOR PROTEIN STRUCTURE PREDICTION ..............................................................52
FLOUDAS_REFINESERVER ..................................................................................................................................... 54
PRINCETON_TIGRESS 2.0: PROTEIN GEOMETRY REFINEMENT USING SIMULATIONS AND SUPPORT VECTOR MACHINES .................54
FLOUDAS_SERVER ................................................................................................................................................ 56
FAST AND ACCURATE TEMPLATE-BASED PROTEIN STRUCTURE PREDICTION ..............................................................................56
FONT .................................................................................................................................................................... 57
IMPROVED PROFILE CONSTRUCTION METHODS AND THEIR APPLICATIONS TO 3D STRUCTURE PREDICTION OF PROTEINS .................57
GAPF_LNCC .......................................................................................................................................................... 59
META PROTEIN STRUCTURE PREDICTION WITH A MULTIPLE MINIMA GENETIC ALGORITHM ..........................................................59
GAPF_LNCC_SERVER ............................................................................................................................................ 61
GAPF_LNCC_SERVER: A FULLY AUTOMATED SERVER FOR TEMPLATE-FREE PROTEIN STRUCTURE PREDICTION WITH A MULTIPLE MINIMA
GENETIC ALGORITHM .................................................................................................................................................... 61

GOAL, LEE ............................................................................................................................................................ 63


PROTEIN STRUCTURE MODELING BY GLOBAL OPTIMIZATION...................................................................................................63
GRUDININ ............................................................................................................................................................ 65
USING A NOVEL KNOWLEDGE-BASED DISTANCE-DEPENDENT POTENTIAL DERIVED USING CONVEX OPTIMIZATION FOR FOLD RECOGNITION
OF CASP12 TARGETS.................................................................................................................................................... 65

GRUDININ (SAXS-ASSISTED) ................................................................................................................................. 67


SAXS-GUIDED PROTEIN STRUCTURE RELAXATION ALONG THE LOWEST NORMAL MODES WITH APPLICATION TO CASP12 SAXS-ASSISTED
TARGETS ..................................................................................................................................................................... 67

GRUDININ-DEEPL ................................................................................................................................................. 69
USING DEEP LEARNING FOR FOLD RECOGNITION OF CASP12 TARGETS ....................................................................................69
INNOUNRES ......................................................................................................................................................... 71
USE OF MODIFIED UNRES FORCE FIELD AND REPLICA-EXCHANGE MOLECULAR DYNAMICS IN PHYSICS-BASED TEMPLATE-FREE
PREDICTION OF PROTEIN STRUCTURES............................................................................................................................... 71

HHGG ................................................................................................................................................................... 73

3
HH GUIDE TO THE GALAXY: COLLABORATIVE PROTEIN STRUCTURE PREDICTION BY HHPRED AND GALAXYREFINE .............................73
IFOLD_1, IFOLD_2, DEEPFOLD-CONTACT, DEEPFOLD-BOOM, NAIVE ..................................................................... 75
PROTEIN CONTACT PREDICTION WITH DEEP FULLY CONVOLUTIONAL NEURAL NETWORK............................................................75
INTFOLD4 ............................................................................................................................................................. 77
FULLY AUTOMATED PREDICTION OF PROTEIN TERTIARY STRUCTURES WITH LOCAL MODEL QUALITY SCORES USING THE INTFOLD4
SERVER ......................................................................................................................................................................77
JONES-UCL ........................................................................................................................................................... 80
PROTEIN STRUCTURE PREDICTION USING FRAGFOLD AND METAPSICOV ..............................................................................80
KIAS-GDANSK....................................................................................................................................................... 81
PROTEIN STRUCTURE PREDICTION WITH THE UNRES FORCE FIELD AIDED BY KNOWLEDGE-BASED INFORMATION ..............................81
KIHARALAB .......................................................................................................................................................... 84
STRUCTURE PREDICTION, REFINEMENT, AND QUALITY ASSESSMENT WITH PRESCO, MD, AND GMQ .........................................84
KF-CONSENSUS .................................................................................................................................................... 86
KLOCZKOWSKI...................................................................................................................................................... 87
KSCONS................................................................................................................................................................ 88
INFERRING LONG RANGE CONTACTS USING THE KNOB-SOCKET MODEL ..................................................................................88
LAUFER, LAUFER_SEED ......................................................................................................................................... 90
USING PHYSICS AND BAYESIAN INFERENCE FOR STRUCTURE PREDICTION .................................................................................90
LEE ....................................................................................................................................................................... 93
PROTEIN STRUCTURE MODELING BY GLOBAL OPTIMIZATION...................................................................................................93
LNCCUNB ............................................................................................................................................................. 94
TEMPLATE--FREE PROTEIN STRUCTURE PREDICTION USING ATOMIC BURIALS PREDICTION WITH A MULTIPLE MINIMA GENETIC ALGORITHM
................................................................................................................................................................................94
MCGUFFIN ........................................................................................................................................................... 96
MANUAL PREDICTION OF PROTEIN TERTIARY AND QUATERNARY STRUCTURES, 3D MODEL REFINEMENT AND INTERFACE ACCURACY
ASSESSMENT ...............................................................................................................................................................96
MELDINGKSCONS................................................................................................................................................. 99
INFERRING LONG RANGE CONTACTS USING THE KNOB-SOCKET MODEL ..................................................................................99
MESHI_CON_SERVER ......................................................................................................................................... 101
N-MESHI-SCORE – A DECOY QA METHOD ....................................................................................................................101
MESHI_SERVER .................................................................................................................................................. 103
MESHI-SCORE – AN INDIVIDUAL DECOY QA METHOD.......................................................................................................103
METAPSICOV ...................................................................................................................................................... 105
CONSENSUS COEVOLUTION-BASED PROTEIN CONTACT PREDICTION USING METAPSICOV 2.0 ..................................................105
MODFOLD6, MODFOLD6_COR, MODFOLD6_RANK ............................................................................................ 107
AUTOMATED 3D MODEL QUALITY ASSESSMENT USING THE MODFOLD6 SERVER ..................................................................107

4
MODFOLDCLUST2 .............................................................................................................................................. 110
AUTOMATED 3D MODEL QUALITY ASSESSMENT USING MODFOLDCLUST2 ..........................................................................110
MUFOLD ............................................................................................................................................................ 112
PROTEIN STRUCTURE PREDICTION AND REFINEMENT USING MUFOLD BY SAMPLING CASP DECOYS .............................................112
MUFOLDQA_C ................................................................................................................................................... 114
A 2-STAGE CONSENSUS QUALITY ASSESSMENT METHOD FOR PROTEIN STRUCTURE PREDICTION ...............................................114
MUFOLDQA_S .................................................................................................................................................... 116
A TEMPLATE-BASED QUALITY ASSESSMENT METHOD FOR PROTEIN STRUCTURE PREDICTION ...................................................116
MULTICOM ........................................................................................................................................................ 118
TERTIARY STRUCTURE PREDICTION BY THE MULTICOM HUMAN GROUP .............................................................................118
MULTICOM (TS).................................................................................................................................................. 120
SAXS-ASSISTED TERTIARY STRUCTURE PREDICTION BY MULTICOM ...................................................................................120
MULTICOM-CLUSTER (TS) ................................................................................................................................... 122
PROTEIN TERTIARY STRUCTURE PREDICTION BY MULTICOM-CLUSTER SERVER ...................................................................122
MULTICOM-CLUSTER (QA).................................................................................................................................. 124
PROTEIN MODEL QUALITY ASSESSMENT BY MULTICOM-CLUSTER SERVER ........................................................................124
MULTICOM-CONSTRUCT (TS) ............................................................................................................................. 126
PROTEIN TERTIARY STRUCTURE PREDICTION BY MULTICOM-CONSTRUCT SERVER .............................................................126
MULTICOM-NOVEL (TS) ...................................................................................................................................... 128
IMPROVED INTEGRATION OF TEMPLATE-BASED AND TEMPLATE-FREE MODEL SAMPLING METHODS FOR PROTEIN STRUCTURE PREDICTION
..............................................................................................................................................................................128
MULTICOM-NOVEL, MULTICOM-CONSTRUCT, MULTICOM-CLUSTER (RR) ........................................................... 130
MACHINE LEARNING, COEVOLUTION-BASED AND HYBRID METHODS FOR CONTACT PREDICTION ...............................................130
MULTICOM-REFINE ............................................................................................................................................ 132
DE NOVO PROTEIN MODELING VIA STEPWISE, PROBABILISTIC SYNTHESIS AND ASSEMBLY OF FOLDON UNITS ...................................132
MYPROTEIN-ME, SKWARK ................................................................................................................................. 134
CONVOLUTIONAL NEURAL NETWORKS FOR PROTEIN CONTACT PREDICTION FROM COEVOLUTION ALONE. ......................................134
NAÏVE ................................................................................................................................................................ 136
PROTEIN CONTACT PREDICTION WITH DEEP FULLY CONVOLUTIONAL NEURAL NETWORK..........................................................136
PCOMB-DOMAIN ............................................................................................................................................... 137
DOMAIN-LEVEL MODEL QUALITY ASSESSMENT TO IMPROVE OVERALL MODEL QUALITY..............................................................137
PCONS, PCONS-NET ........................................................................................................................................... 139
IMPROVED MODEL QUALITY ASSESSMENTS USING ROSETTA ENERGY TERMS AND DEEP LEARNING ................................................139
PCONSC2, PCONSC3, PCONSC31 ........................................................................................................................ 140
IMPROVED CONTACT PREDICTION FOR SMALL FAMILIES USING PCONSC3 COMBINING DCA AND NON-DCA METHODS ...................140
PCONS-NET ........................................................................................................................................................ 143

5
PCONSFOLD2: OPTIMISING THE INFORMATION FROM CONTACT PREDICTION IN AB INITIO PROTEIN FOLDING .................................143
PHYRETOPOALPHA ............................................................................................................................................ 145
PROTEIN FOLD RECOGNITION USING CONTACT THREADING IN THE PROGRAM PHYREPOWER ......................................................145
PROQ2 ............................................................................................................................................................... 147
MODEL QUALITY ASSESSMENT AND SELECTION USING PROQ2 ...........................................................................................147
PROQ3, PROQ3_1, PROQ3_1DISO, PROQ3_2DISO, RSA_SS_CONS, PCONS, PCONS-NET .................................... 148
IMPROVED MODEL QUALITY ASSESSMENTS USING ROSETTA ENERGY TERMS AND DEEP LEARNING ................................................148
PROTSAV-PLUS ................................................................................................................................................... 151
PROTSAV+: A METASERVER FOR PROTEIN TERTIARY STRUCTURE QUALITY ASSESSMENT ............................................................151
QASPROCL ......................................................................................................................................................... 153
QUALITY ASSESSMENT OF PROTEIN STRUCTURE BASED ON CLUSTERING ................................................................................153
QASPROGP ........................................................................................................................................................ 154
QUALITY ASSESSMENT OF PROTEIN STRUCTURE BASED ON GEOMETRICAL PARAMETERS ...........................................................154
QMEANDISCO .................................................................................................................................................... 156
QMEANDISCO – COMBINING STATISTICAL POTENTIALS WITH CONSENSUS-BASED PREDICTION OF LOCAL MODEL QUALITY ...........156
QUARK ............................................................................................................................................................... 158
NNB-QUARK: AB INITIO PROTEIN STRUCTURE ASSEMBLY GUIDED BY SEQUENCE-BASED CONTACT PREDICTIONS ........................158
RAPTORX-CONTACT ........................................................................................................................................... 160
ACCURATE DE NOVO PREDICTION OF PROTEIN CONTACT MAP BY ULTRA-DEEP LEARNING .......................................................160
RBO_ALEPH ....................................................................................................................................................... 162
PROTEIN STRUCTURE PREDICTION BY RBO ALEPH IN CASP12 ...........................................................................................162
RBO-EPSILON ..................................................................................................................................................... 165
CONTACT PREDICTION BY RBO-EPSILON IN CASP12 ........................................................................................................165
ROSETTA_AT_KINGSTON .................................................................................................................................... 167
OPTIMIZING ROSETTA ACCORDING TO A TARGET’S STRUCTURAL CLASS ..................................................................................167
RRCPRED ............................................................................................................................................................ 168
RESIDUE-RESIDUE CONTACT PREDICTION ........................................................................................................................168
SEOK-ASSEMBLY, SEOK (ASSEMBLY) ................................................................................................................... 169
PREDICTION OF PROTEIN COMPLEX STRUCTURES BY GALAXY IN CASP12 ............................................................................169
SEOK-SERVER, SEOK (REFINEMENT) ................................................................................................................... 171
AUTOMATIC PROTEIN STRUCTURE REFINEMENT WITH AN IMPROVED ENERGY FUNCTION AND DIVERSE SAMPLING OF UNRELIABLE REGIONS
..............................................................................................................................................................................171
SEOK-SERVER (TS, QA), SEOK-REFINE (TS) .......................................................................................................... 173
GALAXY IN CASP12: FULLY AUTOMATED PROTEIN STRUCTURE PREDICTION, MODEL QUALITY ASSESSMENT, AND REFINEMENT .......173
SHEN-GROUP ..................................................................................................................................................... 175
R2C 2.0: AB INITIO RESIDUE CONTACT MAP PREDICTION USING DYNAMIC FUSION STRATEGY AND GAUSSIAN NOISE FILTER ..............175

6
SHORTLE ............................................................................................................................................................ 177
ENERGY-BASED REFINEMENT USING A GENETIC ALGORITHM WITH MULTIPLE SELECTION TERMS AND SURVIVAL FUNCTIONS ..........177
SKWARK ............................................................................................................................................................. 179
CONVOLUTIONAL NEURAL NETWORKS FOR PROTEIN CONTACT PREDICTION FROM COEVOLUTION ALONE. ......................................179
SPIDERS ............................................................................................................................................................. 180
EVOLUTIONARY APPROACH TO TEMPLATE FREE AB-INITIO PROTEIN STRUCTURE PREDICTION GUIDED BY STATISTICAL ENERGY FUNCTION
..............................................................................................................................................................................180
SSTHREAD .......................................................................................................................................................... 183
TEMPLATE-FREE PROTEIN STRUCTURE PREDICTION USING SSTHREAD ....................................................................................183
SUMMA_PMF .................................................................................................................................................... 185
KB_0.1_V2.0: REFINEMENT USING AN MM/PMF HYBRID WITH A MODIFIED DISULFIDE ENERGY PROFILE...............................185
SUN_TSINGHUA ................................................................................................................................................. 186
ALL-ATOM CONDITIONED SELF-AVOIDING WALK: AN AB INITIO PROTEIN FOLDING METHOD ...................................................186
TASSER ............................................................................................................................................................... 189
TASSER IN CASP12 ..................................................................................................................................................189
TSLAB-ASSEMBLY ............................................................................................................................................... 190
PREDICTION OF OLIGOMERIC PROTEIN STRUCTURES BASED ON TEMPLATE-BASED MODELING...................................................190
TSSPRED2........................................................................................................................................................... 191
TSPPRED: A COMPOSITE APPROACH FOR TERTIARY STRUCTURE PREDICTION OF PROTEINS .......................................................191
UNRES (TS) ......................................................................................................................................................... 192
USE OF UNRES FORCE FIELD AND REPLICA-EXCHANGE MOLECULAR DYNAMICS IN PHYSICS-BASED TEMPLATE-FREE PREDICTION OF
PROTEIN STRUCTURES .................................................................................................................................................192

UNRES (REFINEMENT) ........................................................................................................................................ 195


ALL ATOM MOLECULAR DYNAMICS AS A REFINEMENT TOOL...............................................................................................195
VOROMQA, VOROMQASR, VOROMQA-SELECT .................................................................................................. 197
MODEL QUALITY ASSESSMENT AND SELECTION USING VOROMQA .....................................................................................197
WALLNER ........................................................................................................................................................... 199
COMBINING PROQ2 AND PCONS TO IMPROVE MODEL QUALITY ASSESSMENT, SELECTION AND REFINEMENT .................................199
WANG1, WANG2, WANG3, WANG4 (QA) ........................................................................................................... 201
RESIDUE-SPECIFIC QUALITY ASSESSMENT OF INDIVIDUAL PROTEIN MODELS USING ENSEMBLES OF DEEP NETWORKS, SUPPORT VECTOR
MACHINES, AND RANDOM FORESTS ..............................................................................................................................201
WANG1, WANG2, WANG3, WANG4 (RR) ............................................................................................................ 203
PROTEIN RESIDUE-RESIDUE CONTACT PREDICTION USING DEEP LEARNING AND DIRECT-COUPLING ANALYSIS ..............................203
WFALL-CHENG ................................................................................................................................................... 205
TERTIARY STRUCTURE PREDICTION BY WFALL-CHENG........................................................................................................205
WF-BAKER-UNRES .............................................................................................................................................. 207

7
PREDICTION OF PROTEIN STRUCTURE WITH THE UNRES FORCE FIELD AIDED BY CONTACT- AND SECONDARY-STRUCTURE PREDICTION
DERIVED FROM EVOLUTIONARILY RELATED PROTEINS ..........................................................................................................207

WFCPUNK .......................................................................................................................................................... 209


PREDICTION OF PROTEIN STRUCTURE WITH THE UNRES FORCE FIELD AIDED BY SECONDARY-STRUCTURE AND CONTACT PREDICTION .209
WFDB_BW_SVGROUP ........................................................................................................................................ 211
ASSESSING PROTEIN STRUCTURE MODELS USING PROTEIN STRUCTURE NETWORKS [PSN-QA] .................................................211
WFMESHI-SEOK ................................................................................................................................................. 214
WEFOLD – THE WFMESHI-SEOK BRANCH ......................................................................................................................214
WFMESHI-TIGRESS ............................................................................................................................................. 216
WEFOLD – THE WFMESHI-TIGRESS BRANCH ................................................................................................................216
WFROSETTA-MUFOLD ........................................................................................................................................ 218
PROTEIN STRUCTURE REFINEMENTS BY USING MUFOLD TO SAMPLE ROSETTA DECOYS ..............................................................218
WFROSETTA-PROQ-MESHI ................................................................................................................................. 220
WEFOLD – THE WFROSETTA-PROQ-MESHI PIPELINE .......................................................................................................220
WFROSETTA-PROQ-MODF6 ............................................................................................................................... 223
SELECTION OF ROSETTA DECOYS USING PROQ2 AND THE MODFOLD6 SERVER ......................................................................223
WFROSETTA-WALLNER ....................................................................................................................................... 226
WEFOLD – THE WFROSETTA-WALLNER BRANCH ..............................................................................................................226
WFROSETTA-PQ-MESHI-MSC ............................................................................................................................. 229
SELECTION OF ROSETTA DECOYS USING PROQ2 AND MESHI-MSC .....................................................................................229
WFRSTTA-PQ2-SEDER......................................................................................................................................... 232
YANG-SERVER .................................................................................................................................................... 233
RESIDUE-RESIDUE CONTACT PREDICTION BY INTEGRATING TEMPLATE-BASED MODELING AND AB-INITIO CONTACT PREDICTION METHODS
..............................................................................................................................................................................233
YASARA .............................................................................................................................................................. 235
THE YASARA HOMOLOGY MODELING MODULE V4.0 WITH NEW PROFILE AND THREADING METHODS FOR REMOTE TEMPLATE
RECOGNITION ............................................................................................................................................................235

ZHANG ............................................................................................................................................................... 237


PROTEIN STRUCTURE PREDICTIONS BY ZHANG HUMAN GROUP IN CASP12 ............................................................................237
ZHANG_CONTACT .............................................................................................................................................. 239
NN-BAYES: COMBINING NEURAL NETWORK AND NAÏVE BAYES CLASSIFIER FOR SEQUENCE-BASED PROTEIN CONTACT PREDICTION 239
ZHANG-REFINEMENT ......................................................................................................................................... 241
ATOMIC-LEVEL PROTEIN STRUCTURE REFINEMENT BY HOMOLOGY TEMPLATE GUIDED MOLECULAR DYNAMICS SIMULATIONS .........241
ZHANG-SERVER .................................................................................................................................................. 243
PROTEIN STRUCTURE PREDICTIONS BY I-TASSER IN CASP12 .............................................................................................243
CAPRI: BATES_BMM ........................................................................................................................................... 245

8
DESCENDING THE PROTEIN BINDING FUNNEL – DOCKING, SCORING AND REFINEMENT OF PROTEIN COMPLEXES IN CASP-CAPRI 2016
..............................................................................................................................................................................245
CAPRI: CLUSPRO ................................................................................................................................................ 248
PERFORMANCE OF CLUSPRO SERVER IN 2016 CASP/CAPRI ROUNDS .................................................................................248
CAPRI: FERNANDEZ-RECIO ................................................................................................................................. 249
COMBINATION OF TEMPLATE-BASED, AB INITIO DOCKING AND SCORING WITH PYDOCK FOR THE MODELING OF PROTEIN ASSEMBLIES IN
THE CASP12-CAPRI CHALLENGE .................................................................................................................................249

CAPRI: HADDOCK ............................................................................................................................................... 252


HADDOCK’S PERFORMANCE IN THE SECOND JOINT CASP-CAPRI ROUND ...........................................................................252
CAPRI: GRUDININ .............................................................................................................................................. 255
USING MACHINE-LEARNING, SYMMETRY INFORMATION AND EXHAUSTIVE SPACE EXPLORATION FOR TEMPLATE-FREE MODELING IN CAPRI
ROUND 37 ...............................................................................................................................................................255
CAPRI: INTERPRED ............................................................................................................................................. 256
INTERPRED: A PIPELINE TO IDENTIFY AND MODEL PROTEIN-PROTEIN INTERACTIONS ..............................................................256
CAPRI: OLIVA ..................................................................................................................................................... 258
PERFORMANCE OF A PURE CONSENSUS APPROACH FOR THE SCORING OF DOCKING DECOYS IN CASP12-CAPRI37 .....................258
CAPRI: LENSINK.................................................................................................................................................. 260
MODELING PROTEIN-PROTEIN ASSEMBLIES: THE CASP12-CAPRI CHALLENGE .......................................................................260
CAPRI: TAKEDA-SHITAKA_LAB ............................................................................................................................ 262
PREDICTION OF OLIGOMERIC PROTEIN STRUCTURES BASED ON TEMPLATE-BASED MODELING IN CAPRI ROUND 37 .........262

CAPRI: VAKSER ................................................................................................................................................... 263


MODELING CAPRI TARGETS 110 – 120 BY TEMPLATE-BASED AND FREE DOCKING ................................................................263
CAPRI: VAKSER ................................................................................................................................................... 265
SEMI-ANALYTICAL CONTACT POTENTIAL FOR PROTEINS AND PROTEIN COMPLEXES ....................................................................265
CAPRI: VENCLOVAS ............................................................................................................................................ 267
COMPARATIVE MODELING OF PROTEIN COMPLEXES IN CAPRI ROUND 37 ............................................................................267
CAPRI: ZOU_GROUP ........................................................................................................................................... 269
THE QUALITY OF MONOMERIC STRUCTURE IS IMPORTANT TO THE DOCKING-BASED PROTEIN COMPLEX STRUCTURE PREDICTION ....269
CASP RELATED: CAMEO ...................................................................................................................................... 272
CAMEO - CONTINUOUS AUTOMATED MODEL EVALUATION ..............................................................................................272
CASP RELATED: QUATERNARY STRUCTURE PREDICTION .................................................................................... 274
MODELING OF PROTEIN QUATERNARY STRUCTURE OF HOMO- AND HETERO-OLIGOMERS BEYOND BINARY INTERACTIONS .................274
CASP RELATED: SAXS RESULTS FOR THE DATA-ASSISTED CATEGORY ................................................................... 275
SMALL ANGLE X-RAY SCATTERING PROVIDES VALUABLE STRUCTURAL INFORMATION FOR PROTEIN STRUCTURE PREDICTION .............275

9
AIR

AIR: an artificial intelligence-based protocol for protein structure refinement using multi-
objective particle swarm optimization

Ling Geng and Hong-Bin Shen


Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of
System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
hbshen@sjtu.edu.cn

The AIR is an Artificial Intelligence-based protein Refinement method, which is constructed


using a multi-objective particle swarm optimization (PSO) protocol. The basic motivation of AIR
is trying to solve the bias problem caused by minimizing only a single energy function due to the
diversity of different protein structures. Thus the fundamental idea of our method is to use
multiple energy functions as multi-objectives so as to correct the potential inaccuracy from a
single function. We designed a multi-objective PSO algorithm-based structure refinement, where
in the protocol each protein structure is taken as the particle. With the process of structure
refinement, the particles will also move around. The quality of current particles (structures) will
be evaluated by two energy functions, and we decide which particles are non-dominated
particles, which means the value of at least one objective energy for those particles are less than
all other particles. These non-dominated particles will be put into a set called Pareto set. After the
iteration converges, the particles from the Pareto set will be screened and part of them will be
outputted, which are the final refined structures.

Methods
The protocol of our method can be divided into three steps. The first step is initial
particles(structures) collection, the second step is the main cycle of the multi-objective PSO
simulation, and the third step is for final structure selection.
In the first step, we collect several different initial models generated by the predictors from the
same amino acid sequence, and these initial models are viewed as initial particles, which are the
input to the optimization procedure.
In the second step, the main iterations are performed, two energy functions are designed as
two fitness functions in the AIR protocol, which are Rosetta energy function[1] and QUARK
energy function[2]. Each iteration includes two parts: 1)update the position of the particles[3],
which means updating the conformation of the structures; 2) evaluate the particles by the above
two fitness functions, and calculate the value of the two energy functions for each particle. And
then select non-dominated particles into Pareto sets [4].
In the third step, we rank all the Pareto optimal particles in the Pareto set according to a
particle-particle angle rule, and thus a final solution set from the will be outputted.
The existing 1-D optimization of a single energy function may take the structure too far away
without a constraint. The multi-objective energy function optimization strategy designed in the
AIR protocol provides a different constraint view of the structure, by extending the 1-D
optimization to a new 2-D space optimization driven by the AI PSO engine.

10
1. Rohl C A, Strauss C E M, Misura K M S, et al. Protein Structure Prediction Using Rosetta.
Methods in Enzymology. 383:66-93 (2004).
2. Xu D, Zhang Y. Ab initio protein structure assembly using continuous structure fragments
and optimized knowledge-based force field. Proteins Structure Function & Bioinformatics.
80:1715-35 (2012).
3. Cheung NJ and Shen HB, Hierarchical particle swarm optimizer for minimizing the non-
convex potential energy of molecular structure. Journal of Molecular Graphics and
Modelling. 54: 114-122 (2014).
4. Coello C A C, Pulido G T, Lechuga M S. Handling multiple objectives with particle swarm
optimization. IEEE Transactions on Evolutionary Computation. 8:256-279 (2004).

11
AkbAR

AkbAR: A GBM meta-classifier combining contact detecting methods and sequence


derived features

R. Rawi1 and R. Mall2


1 - Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institute of Health, 2 -
Qatar Computing Research Institute, Hamad Bin Khalifa University
reda.rawi@nih.gov

Although various contact detecting methods such as PSICOV1, DCA2 and plmDCA3 provide
reasonable predictions, they do not overlap in their results. New approaches applying machine
learning methods to combine contact detecting techniques and sequence derived features
outperform the single methods significantly. Amidst many other methods such DNCON4,
PconsC5 or CoinDCA6, MetaPSICOV7 appears to be the state-of-the-art due to its better
prediction capability (accuracy, precision and recall). In the present CASP12, we applied a new
meta-classifier which is based on gradient boosting. Incorporating several contact detecting
methods as well as sequence derived features we show that AkbAR easily outperforms
MetaPSICOV on the PSICOV 150 target protein test set.

Methods
Our proposed meta-classifier AkbAR combines several distinct approaches used for inferring
contacts from multiple sequence alignments along with a broad range of sequence-derived
features which capture both local as well as global quality of the input protein sequence.
AkbAR generated two alignments of homologous sequences to the input protein sequence. We
applied two homology detecting techniques. First, hhblits8 was used along with the uniprot20
database, and second, jackhmmer of the hmmer suite9 along with the uniref90 database. Both
homology detecting methods have been applied with default parameters. Next, we calculated
contact predictions for both alignments using mutual information, Freecontact10 and
CCMpred11. Further, we applied alignment column derived features such as entropy and
position-specific-scoring matrices. Finally, we included sequence position specific metrics such
hydrophobicity, amino acid type and secondary structure information derived by DeepCNF12.
We used gradient boosting machine (gbm) as the core predictive model (available in the h2o
package in R13) for building separate learning models for short, medium and long range
contacts. The primary reason for choice of gbm as the machine learning model was its ability to
provide feature importance and effective scalability (www.h2o.ai).

Results
We evaluated the performance of AkbAR on the original PSICOV benchmark test set using
standard assessment metrics applied in CASP. We distinguished hereby between short, medium
and long contacts. AkbAR achieves mean accuracies of 0.88, 0.85 and 0.87 for the top L/10 and
0.84, 0.76 and 0.5 for the top L/5 predicted long, medium and short range contacts, respectively.

12
Noticeably, AkbAR outperforms MetaPSICOV on all contact ranges and all reduced list sizes.

Availability
The proposed method is currently being transformed into a R package and will independently be
available soon.

1. Jones et al., 2012. Jones, D. T., Buchan, D. W. A., Cozzetto, D., & Pontil, M. (2012).
Bioinformatics, 28 (2), 184–190.
2. Morcos et al., 2011. Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D. S., Sander,
C., Zecchina, R., Onuchic, J. N., Hwa, T., & Weigt, M. (2011). Proceedings of the National
Academy of Sciences, 108 (49), E1293–E1301.
3. Ekeberg et al., 2013. Ekeberg, M., L ̈ovkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013).
Physical review. E, Statistical, nonlinear, and soft matter physics, 87 (1), 012707.
4. Eickholt, J. & Cheng, J. (2012). Bioinformatics, 28 (23), 3066–3072.
5. Skwark et al., 2013. Skwark, M. J., Abdel-Rehim, A., & Elofsson, A. (2013). Bioinformatics
(Oxford, England), 29 (14), 1815–6.
6. Ma et al., 2015. Ma, J., Wang, S., Wang, Z., & Xu, J. (2015). Bioinformatics (Oxford,
England), 31 (21), 3506–13.
7. Jones et al., 2015. Jones, D. T., Singh, T., Kosciolek, T., & Tetchner, S. (2015).
Bioinformatics (Oxford, England), 31 (7), 999–1006.
8. Remmert et al., 2012. Remmert, M., Biegert, A., Hauser, A., & S ̈oding, J. (2012). Nature
methods, 9 (2), 173–5.
9. Eddy, 2008. Eddy, S. R. (2008). PLoS computational biology, 4 (5), e1000069.
Eickholt &
Cheng, 2012.
10. Kaj ́an et al., 2014. Kaj ́an, L., Hopf, T. A., Kalaˇs, M., Marks, D. S., & Rost, B. (2014).
BMC bioinformatics, 15, 85.
11. Seemayer et al., 2014. Seemayer, S., Gruber, M., & So ̈ding, J. (2014). Bioinformatics
(Oxford, England), 30 (21), 3128–30.
12. Wang et al., 2016. Wang, S., Peng, J., Ma, J., & Xu, J. (2016). Scientific reports, 6, 18962.

13
Atome2_CBS

@TOME-2: a pipeline for comparative modeling of protein-ligand complexes.

Jean-Luc Pons, Jérôme Gracy, Gilles Labesse


Centre de Biochimie Structurale / CBS - CNRS UMR 5048 - UM - INSERM U 105, 29 rue de Navacelles 34090
MONTPELLIER France
labesse@cbs.cnrs.fr

@TOME 2.2 1 is a web pipeline dedicated to protein structure modeling and small ligand
docking based on comparative analyses. @TOME 2.2 allows fold recognition, template
selection, structural alignment editing, structure comparisons, 3D-model building and evaluation.
These tasks are routinely used in sequence analyses for structure prediction. In our pipeline the
necessary bioinformatic tools were efficiently interconnected in an original manner to accelerate
all the processes. Furthermore, we have also connected comparative docking of small ligands
that is performed using protein–protein superposition. The input is a simple protein sequence in
one-letter code with no comment. The resulting 3D model, protein–ligand complexes and
structural alignments can be visualized through dedicated Web interfaces or can be downloaded
for further studies.

Methods
The sequences submitted to CASP12 were automatically treated as follows:
The best structural alignment (SA) are extracted from each fold recognition software
result: Psiblast 2, Hhsearch 3, Fugue 4, Sp3 5. For each SA, a 3D common core is generated by
TITO 6. On the overall results, the 20 best SA are selected according a global score (@TOME-2
Score) based on a set of quality descriptors: Fold recognition tools score, sequence identity
between query/template, quality of alignment (T-coffee 7), compatibility between amino acid
sequence and 3D template (TITO), Verify3D 8 & QMean 9 evaluation scores of model after
optimization of side-chains with Scwrl 3 software 10. Structural clusters are calculated
(Maxcluster 11) and all the SA outside the main cluster are rejected.
In a second step, many multi-template models were computed by MODELLER 9 12. For
each models to construct, 1, 2, 3 or 4 templates have been selected according the best scores
from @TOME-2, TITO and Qmean. For each group of template, the MODELLER model is
calculated with and without additional restrains calculated with Pat 13 on the homologous
crystallographic structure. Among all models obtained, the 5 bests (consensus score of Qmean,
Dfire, Dope and Modeller score) have been proposed to CASP12.

Availability: http://atome.cbs.cnrs.fr/

1. Pons,JL. & Labesse,G. (2009). @TOME-2: a new pipeline for comparative modeling of
protein-ligand complexes. Nucleic Acids Research, Web Server Issue 2009 - doi:
10.1093/nar/gkp368.
2. Altschul et al (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs, Nucleic Acids Res. 25(17), 33100-3402

14
3. Soding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics,
Bioinformatics. 21(7), 951-60.
4. Shi et al (2001). FUGUE: sequence-structure homology recognition using environment-
specific substitution tables and structure- dependent gap penalties. J. Mol. Biol., 310, 243-
257.
5. Zhou,H. & Zhou,Y. (2005). Fold recognition by combining sequence profiles derived from
evolution and from depth-dependent structural alignment of fragments, PROTEINS:
Structure, Function, and Bioinformatics 58, 321–328
6. Labesse,G. and Mornon,J-P. (1998). Incremental threading optimization (TITO) to help
alignment and modelling of remote homologues. Bioinformatics, 14, 206-350
7. Notredame,C. Higgins,DG. Heringa,J. (2000). T-Coffee: A novel method for fast and
accurate multiple sequence alignment. J Mol Biol ,302(1), 205-17.
8. Eisenberg,D. Lüthy,R. Bowie,JU (1997). VERIFY3D: assessment of protein models with
three-dimensional profiles. Methods Enzymol. 277, 396-404.
9. Benkert,P. Tosatto,S.C.E. and Schomburg,D. (2008). "QMEAN: A comprehensive scoring
function for model quality assessment." Proteins: Structure, Function, and Bioinformatics,
71(1), 261-277.
10. Canutescu,A. Shelenkov,A. and Dunbrack,R. L. (2003). A graph theory algorithm for
protein side-chain prediction. Protein Science 12, 2001-2014.
11. Ortiz,A.R., Strauss,C.E. and Olmea,O. (2002). MAMMOTH (matching molecular models
obtained from theory): an automated method for model comparison. Protein Science, 11,
2606-21.
12. Eswar,N. Eramian,D. Webb,B. Shen,M. Sali,A. (2006). Protein Structure Modeling With
MODELLER. Methods in Molecular Biology, Volume 426, 1, 145-159.
13. Gracy,J and Chiche,L (2005) PAT; a protein analysis toolkit for integrated biocomputing on
the web, Nucleic Acids Research Web Server issue, 33, W65-71.

15
BAKER (Refinement)

Addressing low-resolution refinement challenges using Rosetta and MD in CASP12

Hahnbeom Park1 , Sergey Ovchinnikov1, David E Kim1,2, and David Baker1,2


1 – Department of Biochemistry and Institute for Protein Design, University of Washington, 2 – Howard Hughes
Medical Institute, University of Washington
dabaker@uw.edu

One of the most prominent challenges in protein structure refinement is low-resolution


refinement, improving models that have roughly the correct fold but still have severe deviations
from the native structure. Low-resolution refinement is crucial when structure prediction is
available only through comparative modeling with distant homologs or de novo approaches. In
CASP12, we tested our new low-resolution refinement approach which applies effective
conformational search strategies using Rosetta and molecular dynamics (MD) simulations.

Methods
The procedure consists of i) large-scale conformational search using Rosetta1 and ii)
subsequent MD simulations using AMBER2 on selected candidates. We first use Rosetta to
search around the initial model for possible multiple local minima by fragment insertion and
structural relaxation, followed by iterative hybridization which shuffles secondary structure
segments in Cartesian space and inserts fragments from a generic fragment library in torsion
space. This iterative approach, guided by a recently developed Rosetta energy function and a
model selection scheme that maintains structural diversity at each iteration, successfully samples
towards essential directions required for refinement. The procedure is very weakly coupled to
restraints from the initial model, thus final output structures after multiple iterations have low
Rosetta energies and may differ largely from the initial structure. As in CASP11, structural drift
from the starting model during the procedure is allowed until it reaches the difference between
the starting model to the native structure, a GDT-HA value provided by the CASP organizers (i.e.
larger drift is allowed for harder targets).
5 models representing different structural clusters with lowest Rosetta energies are
selected for subsequent MD simulations. For each model, 5 independent replica of 10 ns
simulations are performed using the AMBER12SB force field in a TIP3P explicit water box
followed by structural averaging to obtain the representative conformation3. This MD simulation
stage helps in both sampling and model selection. First, it serves as a sampling technique that we
believe is more effective in the high-resolution regime and is complementary to Rosetta. Second,
MD simulation trajectories offer a reasonable structural ensemble around the Rosetta model for
ensemble-based model selection by using average Rosetta energy values and structural variation
along the MD trajectory ensemble.
For high-resolution targets (GDT-HA > 55), we applied a less aggressive approach
combining Rosetta loop modeling with subsequent MD simulations.

Results
On a benchmark set of 33 low-resolution refinement targets (starting GDT-HA < 55) from
CASP9 to CASP11, model1 structures were improved over initial models more than 5 GDT-HA
for 19 targets, and significantly (> 20 GDT-HA) for 6 targets. The most successful cases were

16
found when starting models were not too difficult (GDT-HA > 40) and small in size (< 120
residues). However, we expect improvements to be less dramatic in CASP12 since many of the
targets are less “refinable”, e.g. form homo- or hetero- oligomers, are too difficult (GDT-HA <
30) or too large, or have initial models from servers that use similar refinement methods
(BAKER-ROSETTA) or other heavy refinement strategies (GOAL).

Availability
The refinement method used in CASP12 is a composite of Rosetta software applications and
AMBER software. Rosetta software is freely available for academic users, and all of the
applications newly developed for CASP12 will be available by end of this year.

1. Song,Y. et al (2013). High-resolution comparative modeling with RosettaCM. Structure 21


(10),1735- 1742.
2. D.A. Case, T.A. Darden, T.E. Cheatham, III, C.L. Simmerling, J. Wang, R.E. Duke, R. Luo,
R.C. Walker, W. Zhang, K.M. Merz, B. Roberts, S. Hayik, A. Roitberg, G. Seabra, J. Swails,
A.W. Götz, I. Kolossváry, K.F. Wong, F. Paesani, J. Vanicek, R.M. Wolf, J. Liu, X. Wu, S.R.
Brozell, T. Steinbrecher, H. Gohlke, Q. Cai, X. Ye, J. Wang, M.-J. Hsieh, G. Cui, D.R. Roe,
D.H. Mathews, M.G. Seetin, R. Salomon-Ferrer, C. Sagui, V. Babin, T. Luchko, S. Gusarov,
A. Kovalenko, and P.A. Kollman (2012), AMBER 12, University of California, San
Francisco.
3. Park, H., DiMaio, F., & Baker, D. (2015). The origin of consistent protein structure
refinement from structural averaging. Structure, 23(6), 1123-1128.

17
BAKER-ROSETTASERVER

Improving Robetta using meta-genome sequences, iterative refinement, and ProQ2

S. Ovchinnikov1, H. Park1, D.E. Kim1,2, and D. Baker1,2


1 - Department of Biochemistry and Institute for Protein Design, University of Washington, WA, USA; 2 - Howard
Hughes Medical Institute
dabaker@uw.edu

Robetta1 (http://robetta.bakerlab.org) is a fully automated structure prediction server that is


continually benchmarked through the structure prediction evaluation project, CAMEO
(http://www.cameo3d.org). Motivated by the successful manual prediction of the 256 residue
free modeling target, T0806, in CASP11, an improved automated implementation was used for
CASP12 that uses meta-genome sequences2 to predict co-evolving residue-residue contacts using
GREMLIN3, a PDB template database supplemented with models of Pfam4 domains not
represented in the PDB but with enough co-evolutionary sequence data for accurate structures,
an iterative hybridization refinement protocol that uses an improved Rosetta energy function5,
and the model quality assessment program, ProQ26 for final model ranking. With these
improvements, similar modeling accuracy was achieved through an automated process for T0806
compared to the successful manual predictions from CASP11.

Methods

Domain boundary prediction. Domain boundaries are predicted by identifying PDB templates
with optimal sequence similarity and structural coverage to the target through an iterative
process. For each iteration, we use locally installed programs, HHSearch7, Sparks8, and
RaptorX9, to identify templates and generate alignments. For CASP12, we supplemented the
HHSearch template database with models of 717 Pfam domains not represented in the PDB but
with enough meta-genome sequences for accurate structure determination using co-evolution
data. The target sequence is threaded onto the template structures to generate partial-threaded
models, which are then clustered to identify distinct topologies that are ranked based on the
likelihood of the alignments. Regions of the target sequence that are not covered by the partial-
threads or are not similar in structure within the top ranked cluster are passed on to the next
search iteration. Through this iterative process, non-overlapping clusters are identified that,
together, cover the full length of the target sequence and domain boundaries are assigned at the
transitions between the clusters. The modeling difficulty of each domain is estimated by the
degree of structural consensus between the top ranked partial threads from each alignment
method.

Structure modeling. For each predicted domain, models are generated using our comparative
modeling protocol, RosettaCM10, which recombines structural elements from the clustered
partial-threads and models missing segments using a combination of fragment insertion and
mixed torsion-Cartesian space minimization. Conformational sampling is performed using the
Rosetta low-resolution score function11 with spatial restraints that are generated separately from
each cluster12. If enough co-evolutionary sequence data exists to accurately predict residue-
residue contacts, the clusters are re-ranked using this information, and the spatial restraints are

18
supplemented with the predicted contacts. For difficult domains, models are also generated using
the Rosetta fragment assembly methodology11 (RosettaAB), and if GREMLIN contacts are
predicted, they are used as restraints for sampling and refinement. Large scale sampling of
around 10,000 - 300,000 models is achieved using the distributed computing project,
Rosetta@home (http://boinc.bakerlab.org/). All models are refined using a relax protocol13 that
minimizes an improved version of the Rosetta full-atom energy in torsion and Cartesian space to
allow bond angle flexibility. For difficult domains less than 300 residues, an iterative
hybridization method was used for further refinement which uses RosettaCM and RosettaAB
models as input and outputs a single refined model (see BAKER Refinement abstract for details
of iterative hybridization). RosettaCM models are selected by clustering the best 100 scoring
models from each topologically distinct alignment cluster, and then averaging the models within
each cluster and refining the final averaged models. RosettaAB models are selected by clustering
the top scoring 10,000 models with 2,000 models of up to two sequence homologs. The
RosettaCM and RosettaAB cluster representatives and the iterative hybridization model are
ranked using ProQ2 for the final selected models.

Domain Assembly. For multi-domain targets, the final models for each domain are assembled
into a full-length model using Rosetta’s domain assembly method14, which assembles domains
using fragment insertion within the linkers.

Availability
Robetta is available for non-commercial use at http://robetta.bakerlab.org. The Rosetta software
suite can be downloaded from http://www.rosettacommons.org. GREMLIN is available for non-
commercial use at http://gremlin.bakerlab.org.

Acknowledgements
We thank Jinbo Xu, Johannes Söding, Yaoqi Zhou for making their excellent software available.

1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the
Robetta server. Nucleic Acids Res 32, W526-W531.
2. Ovchinnikov S., Varghese N., Park H., Huang P., Pavlopoulos G.A., Kim D. E., Kamisetty
H., Kyrpides N. C., Baker D. Protein Structure Determination using Metagenome sequence
data. In revision.
3. Kamisetty,H., Ovchinnikov,S., Baker,D., Assessing the utility of coevolution-based residue-
residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci U S A
110, 15674-15679 (2013).
4. Finn,R.D., Coggill,P., Eberhardt,R.Y., Eddy,S.R., Mistry, J., Mitchell,A.L., Potter,S.C., Punta,
M., Qureshi, M., Sangrador-Vegas, A., Salazar, G.A., Tate, J., Bateman,A., The Pfam protein
families database: towards a more sustainable future, Nucleic Acids Research 44:D279-D285
(2016)
5. Park,H., Bradley,P., Greisen Jr.,P., Liu,Y., Mulligan,V.K., Kim,D.E., Baker,D., DiMaio,F..
Simultaneous optimization of biomolecular energy function on features from small molecules
and macromolecules, submitted.
6. Uziela,K, Wallner,B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics. 32(9), 1411-3
7. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics

19
21 (7), 951-960.
8. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of the query and corresponding native properties of
templates. Bioinformatics 27 (15), 2076-2082.
9. Peng,J. & Xu,J. (2011). RaptorX: Exploiting structure information for protein alignment by
statistical inference. Proteins 79, 161-171.
10. Song,Y. et al (2013). High-resolution comparative modeling with RosettaCM. Structure 21
(10), 1735- 1742.
11. Leaver-Fay,A. et al. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the
Simulation and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
12. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta
comparative modeling. Proteins 79 (8), 2380-2388.
13. Conway,P. et al (2014). Relaxation of backbone bond geometry improves protein energy
landscape modeling. Protein Sci. 23 (1), 47-55. .
14. Wollacott,A., Zanghellini,A., Murphy,P., & Baker,D. (2007). Prediction of structures of
multidomain proteins from structures of the individual domains. Protein Sci. 16 (2), 165-175.

20
Bates_BMM

Protein model construction, optimization and docking using particle swarm optimization

R.A.G. Chaleil1, E. Pfeiffenberger1and P.A. Bates1


1 – Biomolecular Modelling Laboratory, The Francis Crick Institute, 1 Midland Road, London NW1 1AT, UK
raphael.chaleil@crick.ac.uk

The construction, optimization and docking of protein models remains challenging. All require
extensive sampling of the high dimensional conformational space, which is intractable with
methods based on exhaustive enumeration of all possible solutions. In order to address this
problem, we have developed a series of heuristic methods based on Particle Swarm Optimization
(PSO) to elevate the problem and be able to generate high accuracy solutions.

Methods
Our general methodology for fold construction and docking can be described as follows:

1. Fold construction using SwarmLoop


We first search for homologous sequences to the query sequence using HHBlits1 against a
sequence profile database of known structures clustered at 70% sequence identity. We then build
a linear ab initio polypeptide corresponding to the query sequence, and then apply bond lengths,
angles and torsion angles accordingly to identified homologous fragments. All the coil regions
that are not matched with a structural template are adjusted in torsion angle space with the
SwarmLoop algorithm. This algorithm is a constricted PSO2, which searches for a minimal
Dfire3 statistical pair potential energy. When distance information was available, either from
PSICOV4 or from discontinuous templates, a hookean force was applied as a distance restraint
mechanism. We explored two strategies for folding the structure, the first one adjusts all the
torsion angles between all the fragments at once, whereas the second one adjusts the torsion of
each linker region (i.e. regions between fragments from templates) one at a time, starting from
the N-terminal. The latter technique is computationally more expensive, however, it achieves to
generate structures with a smaller radius of gyration (i.e. the structures are more globular). This
property allows to generate better , i.e. biophysically sound, models. Finally, the top 10 ranking
models from 100 replicates of SwarmLoop at 10000 iterations (according to Dfire) are then
minimized with CHARMM5 version 22 and the structure with best CHARMM energy after
minimization is selected for submission.
2. Docking using SwarmDock
For all predicted homo-oligomeric structures we used a modification to our binary protein-
docking algorithm SwarmDock6. Our method uses the principles of PSO to search the parameter
docking space. The innovation with the new algorithm is to treat each particle within the swarm
as an instance of a packed homo-oligomer, constrained by the appropriate symmetry operators.
The objective is to optimize the particle space in order to find the most energetically favorable
homo-oligomer. Particles move through a multi-parameter space by the optimization of two sets
of parameters: orientations and translations of the monomeric units relative to the imposed
symmetry and a linear combinations of normal modes that adjust the conformation of each
monomer, in the presence of the other monomers, in this simultaneous docking process. For

21
hetero-oligomeric structures we employed our standard SwarmDock protocol6. The monomeric
models used for docking were selected from the CASP12 server tarballs.
Furthermore, for the CASP targets that were also labelled as CAPRI targets the best scoring
SwarmDock model was optimized by applying a novel physics-only refinement method.
Essentially, the method works in contact map space (CMS) of observed residue-residue contacts
between the receptor and ligand interface. This CMS is used as a collective variable (CV) in a
metadynamic simulation for enhanced sampling and to reconstruct the conformational free
energy landscape to guide the model selection for optimized solutions. This enables us to not
only refine dimers of protein-protein complexes but also refine multimeric complexes where
each protein-protein interface is represented by its own contact map.
Overall, this unique approach allows us to facilitate enhanced sampling around our variable
of interest (i.e. reducing computational cost) and makes use of the information available from the
different observed contacts to guide the search for new energy minima.

Availability
Manuscript for SwarmLoop and refinement is in preparation.

1. Remmert M., Biegert A., Hauser A. & Söding J. (2011). HHblits: Lightning-fast iterative
protein sequence searching by HMM-HMM alignment. Nat. Methods. 9(2),173-5.
2. Eberhart, R. C. & Kennedy, J. (1995). A new optimizer using particle swarm theory. In
Proceedings of the sixth international symposium on micro machine and human science (pp.
39–43), Nagoya, Japan. Piscataway: IEEE.
3. Y. Yang & Y. Zhou. (2008). Specific interactions for ab initio folding of protein terminal
regions with secondary structures. Proteins 72, 793-803.
4. Jones DT, Buchan DW, Cozzetto D & Pontil M. (2012). PSICOV: precise structural contact
prediction using sparse inverse covariance estimation on large multiple sequence alignments.
Bioinformatics. 28(2), 184-90.
5. Brooks BR, Brooks CL 3rd, Mackerell AD Jr, Nilsson L, Petrella RJ, Roux B, Won Y,
Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer
S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor
RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, WoodcockHL, Wu X, Yang W,
York DM & Karplus M. (2009). CHARMM: the biomolecular simulation program. J.
Comput. Chem. 30(10), 1545-614.
6. Torchala M., Moal I.H., Chaleil R.A.G, Fernandez-Recio, J. & Bates P.A.(2013).
SwarmDock: a server for flexible protein-protein docking. Bioinformatics. 29(6), 807-9.
7. Tribello, G. A., Bonomi, M., et al. (2014). PLUMED 2: New feathers for an old bird.
Computer Physics Communications, 185(2), 604-613.

22
Bates_BMM (refinement)

Physics-only based refinement of predicted protein folds in contact-map space

E. Pfeiffenberger1and P.A. Bates1


1 – Biomolecular Modelling Laboratory, The Francis Crick Institute, 1 Midland Road, London NW1 1AT, UK
erik.pfeiffenberger@crick.ac.uk

The refinement of protein structures remains challenging. Physics-only methods based on


molecular dynamics simulations have achieved some degree of success1. However, they are
usually restricted in their sampling power by energetic barriers that limit the exploration of new
conformational states to a local minima that require long simulation times to overcome, making
them computationally very expensive.
In order to address this problem we present a novel refinement method to improve the
accuracy of predicted protein structures. Essentially, the method makes use of the structural
variation present in the set of submitted predictions to infer restrains for conserved regions with
low variability and to construct a contact map space (CMS) of observed residue-residue contacts
in folds. This CMS is used as a collective variable (CV) in a metadynamic simulation for
enhanced sampling and to reconstruct the conformational free energy landscape to guide the
model selection.
This unique approach allows us to facilitate enhanced sampling around our variable of
interest (i.e. reducing computational cost) and makes use of the information available from the
different conformational states to guide the search for new energy minima.

Methods
The method can be divided into five parts that can be described as follows:
i) Filtering of all available models
All available models from participating predictors of a target are downloaded from the
prediction center server. Each model is compared to the starting model and the Cα root
mean square deviation (RMSD) is calculated, models with a RMSD > 10 Å are removed
from the set.
ii) Deriving position restraints
The filtered set is used to determine structurally conserved residues. These are identified
by computing the per residue root mean square fluctuation (RMSF) of Cα atoms.
Residues with a RMSF < 3 Å are considered conserved and movements are restraint
during the sampling process.
iii) Contact map generation and collective variable definition
From the structures in the filtered set of CASP predictions residue-residue contacts are
identified with a Cα or Cβ distance below 8 Å with the exception of direct neighbors,
which are removed from the list. From these contacts two contact maps (CM) are
generated, namely CMexclusive and CMmin. CMexclusive contains contacts that are exclusive
to one model from the filtered set, whereas the map CMmin contains contacts with the
lowest Cα/Cβ distance. From these CMs we can define two CVs describing the CMS:
2
𝐶𝑉1(𝑅) = 1/𝑁 ∑𝛾∈𝐶𝑀exclusive (𝐷𝛾 (𝑅) − 𝐷𝛾 (𝑅ref )) (1)

23
2
𝐶𝑉2(𝑅) = 1/𝑁 ∑𝛾∈𝐶𝑀min (𝐷𝛾 (𝑅) − 𝐷𝛾 (𝑅ref )) (2)
1−(𝑟𝛾 /𝑟𝛾0 )𝑛
𝐷𝛾 (𝑅) = (3)
1−(𝑟𝛾 /𝑟𝛾0 )𝑚
The sigmoidal distance function Dγ(R) is used to quantify the formation of a contact γ in
structure R, where 𝑟𝛾 is the contact distance in structure R and 𝑟𝛾0 is the contact distance
in reference structure Rref which denotes to one of the models from the filtered set of
CASP12 models where the contact was observed. Variables n and m are constant and set
to n=6 and m=10.
iv) Energy minimization, equilibration and sampling
The preparation of the starting model prior to the sampling process follows a GROMACS
standard procedure where the system is solvated, energy minimized and equilibrated for
300ps.
The sampling with metadynamics in CMS is performed at 300K for 10ns with 5 replicas
for each CM definition, resulting in 100ns sampling data for each target. The sampling of the
CMS was performed with the GROMACS plug-in PLUMED3 where a Gaussian addition is
deposited every 2ps with σ=0.5, a bias factor of 10 and an initial height of 5kJ/mol.
v) Scoring and model building
Snapshots from the trajectories are taken every 10ps, resulting in 9810 frames in total.
Scoring of these frames is based on reconstructing the free energy surface (FES) by
integrating the deposited bias during the simulation and by computing the DFIRE4 energy.
Furthermore we constructed the combined scoring-function CSα that uses both normalized
energies from FES and DFIRE to score frames. Where CSα = (1-α)FESN+αDFIREN with an
α=0.5 resulting in equal contribution of both terms for scoring.
Overall 4 models were generated where each model is based on an average structure from
the 20 best scoring frames from CMmin or CMexclusive by using DFIRE and CSα (i.e. DFIRE20
and CSα20). Each averaged model was subject to a two-step steepest decent minimization
procedure. The first step performed a minimization of the structure in vacuum followed by a
minimization with explicit solvent for 50000 steps in each minimization.

Results
Models based on this approach were submitted to the CASP12 refinement section where they
have the model ids 2-5: (2) CMmin+CSα20, (3) CMmin+DFIRE20, (4) CMexclusive+CSα20 and (5)
CMexclusive+DFIRE20. Model 1 in the submissions is based on the method established in CASP11.
The method was applied to all single chain refinement targets.

1. Modi, V., & Dunbrack, R. L. (2016). Assessment of refinement of template‐based models in


CASP11. Proteins: Structure, Function, and Bioinformatics.
2. Pronk, S., Pall, S., et al. (2013). GROMACS 4.5: a high-throughput and highly parallel open
source molecular simulation toolkit. Bioinformatics
3. Tribello, G. A., Bonomi, M., et al. (2014). PLUMED 2: New feathers for an old bird.
Computer Physics Communications, 185(2), 604-613.
4. Yang, Y., & Zhou, Y. (2008). Specific interactions for ab initio folding of protein terminal
regions with secondary structures. Proteins: Structure, Function, and Bioinformatics, 72(2),
793-803.

24
BhageerathH-Plus

BhageerathH+: A hybrid methodology based software suite for protein tertiary structure
prediction

Rahul Kaushik1,2, Ankita Singh1, Debarati DasGupta1,3, Amita Pathak1,3, Shashank Shekhar1 and
B. Jayaram*1,2,3
1- Supercomputing Facility for Bioinformatics & Computational Biology, 2- Kusuma School of Biological
Sciences,3- Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi (India) – 110016
rahul@scfbio-iitd.res.in, bjayaram@chemistry.iitd.ac.in

The continuously mounting gap between known protein sequences and experimentally solved
structures, the need for structures of novel protein drug targets to usher in structure based drug
discovery strategies to combat the increasing threat of diseases and disorders, the quest for a
better understanding of the function of proteins involved in diverse cellular metabolic pathways,
have raised the urgency for developing highly accurate computational structure prediction
approaches1. In numerous cases, due to various methodological limitations, only homology based
approaches or only ab initio approaches fail to deliver the desired results in structure prediction
and this has led to the emergence of hybrid approaches which efficiently integrate both of them.
Continuous methodological improvements in structure prediction over the past two decades have
been chronicled via CASP experiments2. The success of hybrid approaches in previous CASP
experiments as implemented in some of the state of the art methods, has enabled the scientific
community to explore the experimentally unsolved proteins more efficiently at their structural
level.
In the recently concluded CASP experiment i.e. CASP12, we tested BhageerathH+
software suite which implements a hybrid approach by integrating some of the in house
developed methods along with certain other tools and delivers a reliable structure for the protein
from sequence information.

Methods
BhageerathH+ is an advanced version of Bhageerath-H3 which primarily comprises three major
steps namely structure generation for conformational sampling, structure scoring for selecting the
best conformations and structure refinement and side chain optimization for quality
improvement. The structure generation module integrates recently developed RM2TS4 and
RM2TS+ methodology, New Chemical Logic of amino acid based alignment and structure
generation with previously developed StrGen algorithm5 and Bhageerath6. The sampled
conformations are clustered for filtering out similar topologies. Post-clustering, conformations
representing mutually exclusive topologies are scored and ranked with an improved version of
ProTSAV7 (ProTSAV+) for selecting top 50 conformations. The selected conformations are
further processed with molecular dynamics (MD) based refinement and side chain optimization
for enhancing their structural quality. These MD treated conformations are scored and re-ranked
with ProTSAV+ for selecting top 5 conformations which are mailed to the user at the provided
email id.
Bhageerath-H+ software suite can deliver various applications directly or indirectly like
protein function characterization and annotation, ligand binding site directed drug designing etc..

25
Results
The methodology outlined above was automated and fielded in the recently concluded CASP12
experiment under TS category as BhageerathH+ server. The software suite has performed
reasonably well so far, on targets whose native information is released in PDB. For instance,
BhageerathH+ has succeeded in predicting 12 targets out of 20 targets (whose native information
is released) with rmsds under 5Å and this is as good as the performance of any other
participating server. Also the average GDT score of BhageerathH+ predictions for these targets is
42.8 while the best average GDT score achieved by any participating server thus far is 48.5.
Most of the modules implemented in BhageerathH+ software suite are freely available in public
domain while others are in pipeline.

Availability
BhageerathH+ software suite can be freely accessed at http://scfbio-iitd.res.in/BhageerathH+.

1. Hajduk, P. J. and Greer, J. (2007). A decade of fragment-based drug design: strategic


advances and lessons learned. Nature Reviews Drug Discovery 6, 211-219.
2. Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. and Tramontano, A. (2014). Critical
assessment of methods of protein structure prediction (CASP) — round x. Proteins 82, 1–6.
3. Jayaram, B., Dhingra, P., Mishra, A., Kaushik, R., Mukherjee, G., Singh, A. and Shekhar, S.
(2014). Bhageerath-H: A homology/ab initio hybrid server for predicting tertiary structures of
monomeric soluble proteins. BMC Bioinformatics 15(16), S7.
4. DasGupta, D., Kaushik, R. and Jayaram, B. (2015). From Ramachandran Maps to Tertiary
Structures of Proteins. J. Phys. Chem. B 119 (34), 11136-11145.
5. Dhingra, P. and Jayaram, B. (2013). A homology/ab initio hybrid algorithm for sampling
near-native protein conformations. J Comput. Chem. 34, 1925-36.
6. Jayaram, B., Bhushan, K., Shenoy, S. R., Narang, P., Bose, S., Agrawal, P., Sahu, D. and
Pandey, V. (2006). Bhageerath: an energy based web enabled computer software suite for
limiting the search space of tertiary structures of small globular proteins. Nucleic Acids Res.
34 (21), 6195-6204.
7. Singh, A., Kaushik, R., Mishra, A., Shanker, A. and Jayaram, B. (2015). ProTSAV: A Protein
Tertiary Structure Analysis and Validation Server. BBA - Proteins and Proteomics 1864(1),
11-19.

26
CAMD-BIR_Cuba

Ab initio Structure Prediction Leveraging the Entropic Contribution of the Hydrophobic


Effect in an Empirical Free-Energy Function

Y. B. Ruiz-Blanco1, E. Martínez-Pérez1,2, Z. Abdelazez3 and J. R. Green3


1 – Departamento de Farmacia. Universidad Central “Marta Abreu” de Las Villas, 2 - Laboratorio de
Bioinformática, Fundación Instituto Leloir, 3 – Department of Systems and Computing Engineering. Carleton
University.
yasserrb@uclv.edu.cu; yasserblanco@sce.carleton.ca

CASP events were originally created “to determine the status of current methods for predicting
the three-dimensional structure of proteins”1. Specifically, ab initio structure prediction gathers
methods that generate a structure for a given protein without using structural information from
close homologs. Nonetheless, many ab initio methods have indirectly taken advantage from
partial homology of peptide fragments or from remote similarity to known structures. Notably,
since CASP4 it was highlighted that “the ab initio heading now includes methods that make
significant reference to database information... However, it should not blind us to the intrinsic
importance of being able to predict a fold only on the basis of the single sequence in front of
us”2. Such ultimate aim depends on finding a proper free-energy function to perform an effective
exploration of the conformational space. Although many developments have been made on this
regard, most of the approaches focus on mimicking classic force field formulations using coarse-
grained representations, with the UNRES potential being one of the most relevant examples 3.
Here we use a physics-based scoring function previously introduced by Ruiz-Blanco et
4; 5
al. , which originates from a reductionist view of the folding process, where proteins behave as
elastic springs. Here, compression (folding) is the result of the work of an external force
represented by the hydrophobic effect6. Under such a scenario, the native state (or its basin in the
free energy landscape) is reached when, upon folding, the entropy gain of the surrounding water
molecules can no longer exceed the corresponding loss of conformational entropy. The resulting
close-packing interactions are modelled by enthalpic contributions associated with electrostatic
and Van der Waals interactions, a penalty for uncovered valences of hydrogen bonds in internal
residues, and a backbone torsion potential.

Methods
The scoring potential leverages a novel implicit-solvent approach to the entropy gain of the first
shell of solvation upon folding. This entropy factor represents the dominant component of the
scoring function, while close-packing interactions largely impact local conformations with little
effect on global fold.
The optimization is conducted using a genetic algorithm whose initial population is
seeded with randomly generated all-α and all-β structures. The best solution is selected according
the physic-based potential, however, diversity in intermediate populations is introduced by
selecting individuals according three criteria: i) the scoring potential, ii) an ad hoc function that
favors the prevalence of structures with β-strands and the formation of hydrogen bonds, and iii)
the quotient of the scoring function and the accessible surface area in order to favor highly
compact conformations.

27
The final five candidates are selected by clustering the distinct solutions from several
runs using a simple k-means approach based on structural descriptors computed with the
program ProtDCal7. No refinement or manual modifications were applied over the final
candidates.

Availability
The Java implementations and tutorials for the conformational search algorithm and the program
ProtDCal for the calculation of descriptors, can be freely downloaded from:
http://bioinf.sce.carleton.ca/ProtDCal/download.html.

1. Defay, T. & Cohen, F. E. (1995). Evaluation of Current Techniques for Ab Initio Protein
Structure Prediction. PROTEINS: Structure, Function, and Genetics 23, 431-445.
2. Lattman, E. E. (2001). CASP4 editorial. Proteins: Structure, Function, and Bioinformatics
45, 1-1.
3. Rojas, A. V., Liwo, A. & Scheraga, H. A. (2007). Molecular Dynamics with the United-
Residue (UNRES) Force Field. Ab initio Folding Simulations of Multi-chain Proteins. The
journal of physical chemistry. B 111, 293-309.
4. Ruiz-Blanco, Y. B., Marrero-Ponce, Y., García, Y., Puris, A., Bello, R., Green, J. &
Sotomayor-Torres, C. M. (2014). A physics-based scoring function for protein structural
decoys:Dynamic testing on targets of CASP-ROLL. Chemical Physics Letters 610–611,
135–140.
5. Ruiz-Blanco, Y. B., Marrero-Ponce, Y., Paz, W., García, Y. & Salgado, J. (2013). Global
Stability of Protein Folding from an Empirical Free Energy Function. Journal of
Theoretical Biology 321, 44-53.
6. Ruiz-Blanco, Y. B., Marrero-Ponce, Y., Prieto, P. J., Salgado, J., García, Y. & Sotomayor-
Torres, C. M. (2015). A Hooke‫ ׳‬s law-based approach to protein folding rate. Journal of
theoretical biology 364, 407-417.
7. Ruiz-Blanco, Y. B., Paz, W., Green, J. & Marrero-Ponce, Y. (2015). ProtDCal: A program
to compute general-purpose-numerical descriptors for sequences and 3D-structures of
proteins. BMC Bioinformatics 16, 162.

28
chuo-u

3-Dimensional models created by chuo-u team using the FAMS program, statistical
potentials and some inspection

T. Ichikawa, M. Iwadate and H. Umeyama


Department of Biological Sciences, Chuo University
a11.jxcp@g.chuo-u.ac.jp

Our human team, "chuo-u", attended TS(3D atomic coordinates prediction) category in CASP12.
We used the FAMS[1] as the homology modeling program, PSI-BLAST, and raptorX[2]. We
used CIRCLE[3] calculating statistics potentials that consist of secondary structure and
hydrophobic interaction for each of amino acid residues. We are assuming that the CIRCLE
program gives the Gibbs free energy approximately.

Methods
This team is human team. All the processes were performed by stand-alone software that does
not use other results in CASP12 servers, and model selection and inspection were executed by
using Isolated-FAMS system (Our original system). Homology search and alignments were
performed by PSI-BLAST and raptorX. All the models of search output were built by FAMS.
Model selections were performed by the CIRCLE software. In refinement targets, last main chain
refinement process of FAMS was iterated up to dead line time.

Results
All regular targets and refinement targets were modeled. For all the models, accuracy was
calculated, and results are shown in http://fams.bio.chuo-u.ac.jp/casp/casp12/. This page is
continuing to the PDB database to compare modeled structures with experimental structures.

Availability
FAMS trial version is available in http://fams.bio.chuo-u.ac.jp/fams/
FAMS and CIRCLE are commercial software. Please see http://www.pd-fams.com/

1. Umeyama H, Iwadate M., FAMS and FAMSBASE for Protein Structure. In Current
Protocolsin Bioinformatics. John Wiley & Sons, Inc.(2002).
2. http://raptorx.uchicago.edu/
3. Terashi G, Takeda-Shitaka M, Kanou K, Iwadate M, Takaya D, Hosoi A, Ohta K,
UmeyamaH.,Proteins, 69, Sp.8, 98-107 (2007)

29
chuo-u-server

3-Dimensional models created by Chuo-fams-server team using the FAMS program and
statistical potentials

M. Iwadate, T. Ichikawa and H. Umeyama


Department of Biological Sciences, Chuo University
iwadate@bio.chuo-u.ac.jp

Our server team, "chuo-u-server", attended TS(3D atomic coordinates prediction) category in
CASP12. We used the FAMS[1] as the homology modeling program, PSI-BLAST, and
raptorX[2]. We used CIRCLE[3] calculating statistics potentials that consist of secondary
structure and hydrophobic interaction for each of amino acid residues.

Methods
This team is server team. All the processes were performed by stand-alone software that does not
use other results in CASP12 servers. Homology search and alignments were performed by PSI-
BLAST and raptorX. After those processes, all the 3-dimentional protein models were built by
FAMS. Lastly, protein model selection was performed by CIRCLE software. We are assuming
that the CIRCLE program gives the Gibbs free energy approximately.
In refinement targets, last main chain refinement process of FAMS was iterated up to dead line
time in the CASP12 contest.

Results
All regular targets and refinement targets were modeled. For all the models, accuracy was
calculated, and results are shown in http://fams.bio.chuo-u.ac.jp/casp/casp12/. This page is
continuing to the PDB database to compare modeled results with experimental structures.

Availability
FAMS trial version is available in http://fams.bio.chuo-u.ac.jp/fams/
FAMS and CIRCLE are commercial software. Please see http://www.pd-fams.com/

1. Umeyama H, Iwadate M., FAMS and FAMSBASE for Protein Structure. In Current
Protocolsin Bioinformatics. John Wiley & Sons, Inc.(2002).
2. http://raptorx.uchicago.edu/
3. Terashi G, Takeda-Shitaka M, Kanou K, Iwadate M, Takaya D, Hosoi A, Ohta K,
UmeyamaH.,Proteins, 69, Sp.8, 98-107 (2007)

30
CPCLab

TopModel: A Multi-template Meta-approach to Homology Modelling

D. Mulnaes and H. Gohlke


Mathematisch-Naturwissenschaftliche Fakultät, Institut für Pharmazeutische und Medizinische Chemie, Heinrich-
Heine-Universität Düsseldorf, 40225 Düsseldorf, Germany
gohlke@hhu.de

Knowledge of a protein structure is essential to understand its function1, stability2, interactions3,


and for structure-based protein or drug design4. The current rate of experimental structure
determination, however, is far exceeded by that of next-generation sequencing, resulting in less
than 1/1000th of proteins having a known structure. Two decades of CASP experiments have
shown the value of consensus- and meta-methods that utilize complementary algorithms for
computational structure prediction to alleviate this problem5.

Methods
We present TopModel as a novel consensus approach to structure prediction. TopModel uses
eight state-of-the-art threading tools including the LOMETS threaders6 and HHSuite7 for
template-target alignment. To utilize multiple templates TopModel uses eight state-of-the-art
multiple alignment tools including the TCOFFEE8 suite and PROMALS3D9 for template-
template alignment. Model quality is assessed with the in-house method TopScore, which
predicts the global and residue-wise lDDT10 score of a model using a weighted, normalized
average of scores from eight state-of-the-art model quality assessment programs including
ModFoldClust211 and GOAP12. The predicted lDDT is used to weight alignments and calculate a
consensus multiple alignment based on which templates are ranked. All pairwise, multiple and
consensus alignments are modeled using a modified Modeller13 routine to eliminate knots and
impose secondary-structure specific restraints predicted by PSIPRED 14. The best model is
selected with TopScore and refined with ModRefiner15. At the start of CASP12, TopModel did
not have multi-domain or oligomer utilities but during the competition we started development
of domain prediction and docking tools. Early versions of these allowed for domain-wise
modeling of targets T0866, T0886, and T0920. For T0866 and T0886, domains were manually
placed with termini near each other, and for T0920 FRODOCK16 was used for domain docking
and Modeller for connecting termini. We started development of a tool for oligomer prediction
by searching the PDB for scaffolds upon which the models were superimposed. An early version
of this tool led to dimers T0870 and T0924.

Results
Preliminary analysis of our performance revealed three pitfalls for our method of using model
quality for fold selection: 1. Knots in the models results in bad scores for potentially correct
templates, 2. multi-domain models without a global template disfavor partial hits compared to
potentially false ones with larger coverage, 3. co-evolving chains without stable single-chain
models can lead to wrong template selection when chains are built individually. To address these
issues we implemented knot detection and started development of domain identification and
methods for domain docking and oligomer prediction.

31
Availability
Early versions of TopModel were applied in studies of the nisin operon17, feline APOBEC318,
and ectoin hydroxylase1.

1. Widderich, N., et al. J. M. Biol., 2014. 426(3): p. 586-600.


2. Rathi, P.C., H.W. Höffken, and H. Gohlke. J. Chem. Inf. Model., 2014. 54(2): p. 355-361.
3. Gohlke, H., et al. J. Chem. Inf. Model., 2013. 53(10): p. 2493-2498.
4. Aehle, W., et al. Journal of biotechnology, 1993. 28(1): p. 31-40.
5. Moult, J., et al. Proteins, 2014. 82(S2): p. 1-6.
6. Wu, S. and Y. Zhang. Nucleic Acids Res., 2007. 35(10): p. 3375-3382.
7. Remmert, M., et al. Nature methods, 2012. 9(2): p. 173-175.
8. O'Sullivan, O., et al. J. Mol. Biol., 2004. 340(2): p. 385-395.
9. Pei, J., B.-H. Kim, and N.V. Grishin. Nucleic Acids Res., 2008. 36(7): p. 2295-2300.
10. Mariani, V., et al. Bioinformatics, 2013. 29(21): p. 2722-2728.
11. McGuffin, L.J. and D.B. Roche. Bioinformatics, 2010. 26(2): p. 182-188.
12. Zhou, H. and J. Skolnick. Biophysical Journal, 2011. 101(8): p. 2043-2052.
13. Šali, A. and T.L. Blundell. J. Mol. Biol., 1993. 234(3): p. 779-815.
14. McGuffin, L.J., K. Bryson, and D.T. Jones. Bioinformatics, 2000. 16(4): p. 404-405.
15. Xu, D. and Y. Zhang. Biophysical Journal, 2011. 101(10): p. 2525-2534.
16. Garzon, J.I., et al. Bioinformatics, 2009. 25(19): p. 2544-2551.
17. Khosa, S., et al. Scientific reports, 2016. 6.
18. Zhang, Z., et al. Retrovirology, 2016. 13(1): p. 46.

32
Deepfold-Contact, Deepfold-Boom

Protein Contact Prediction with Deep Fully Convolutional Neural Network

Yang Liu, Qing Ye, Jian Peng


University of Illinois, Urbana Champaign
jianpeng@illinois.edu

See iFold_1, iFold_2, Deepfold-Contact, Deepfold-Boom, naïve.

33
DELCLAB

Protein 3D structure prediction by combining spectral and sequence homology of amino


acid sequences

Carlos A. Del Carpio Muñoz1,2


1-Protein Science Research Institute, Nakayama-cho 2-18, Mizuho, Nagoya, 467-0803, Japan, 2-Drosophila Genetic
Resource Center, Kyoto Institute of Technology, Saga Ippongi-cho, Ukyo-ku 616-8354, Japan.
cadrieldcm@yahoo.com

The widely accepted structural biological notion that protein structure conservation through
natural evolution outpaces sequence conservation poses restrictions on the potentiality of
methodologies oriented to predict the 3D structure of proteins based on sequence homology
analysis alone. This assertion is particularly evident in the so-called twilight zone of sequence
homology (20∼30% of similarity), where prediction of protein 3D structure based only on
sequence homology methodologies are frequently of limited success.
3D structures for CASP12 targets were predicted using a new protocol proposed by the
author that combines orthodox homology methods with newly developed techniques. Partial
analysis of the results shows a significant improvement in the prediction process, namely in
medium difficulty targets.

METHOD
The author so far has proposed an original methodology to gauge for protein 3D structural
similarities at the heart of which is an spectral representation of the sequences of amino acids
represented quantitatively by the values of their different physicochemical properties1; 2. This has
led to an automatic codification of folding patterns that can be used to retrieve patterns in a
classification tree like the SCOP3 groups and families of proteins. The methodology recognizes
protein folding patterns comparing the encoded target protein sequence with the data base of
SCOP families of protein sequences previously encoded following the proposed spectral
protocol. This process plays a pivotal role in homolog identification for sequences of low
similarity.
In CASP12, we have combined this methodology with orthodox sequence based
homology as well as information obtained by secondary structure prediction methods. This has
led to an improvement in the identification of folding patterns that were difficult to find
employing any single methodology at a time. Prediction of the secondary structure assists in the
assignation of the right sequence to any 3D piece of structure, namely when the selection of the
protein is made by the spectral technique proposed by the author.
Here we discuss the effectiveness of our combined methodology when it is blindly
applied to predict the 3D structures of CASP12 targets.

RESULTS
A remarkable improvement in the assignation of protein folding patterns can be observed for the
CASP12 targets whose PDB structure have been released. Nevertheless a whole assessment of
the proposed technique may require a larger set of targets with experimental structures.

34
1. Del Carpio, C. A. & Carbajal, J. C. (2002). Folding pattern recognition in proteins using
spectral analysis methods. Genome Inform 13, 163-72.
2. Del Carpio, C. A. & Yoshimori, A. (2002). Fully automated protein tertiary structure
prediction using Fourier transform spectral methods. Protein Structure Prediction:
Bioinformatics, University of California, International University Line.
3. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural
classification of proteins database for the investigation of sequences and structures. J Mol
Biol 247, 536-40.

35
Distill

Distill for CASP12

B.Alshomrani, M.Torrisi, M.Kaleel, G. Pollastri


UCD Dublin, Ireland
gianluca.pollastri@ucd.ie

Distill has two main components: a fold recognition stage dependent on sets of protein features
predicted by machine learning techniques; an optimisation algorithm that searches the space of
protein backbones under the guidance of a potential based on templates found in the first stage.
Apart from updating the underlying databases, Distill is unchanged since 2012.

Methods
Distill runs 3 rounds of PSI-BLAST against a 90% redundancy reduced UniProt to generate
multiple sequence alignments (MSA). The PSSM from the second round is reloaded to search
the PDB for templates (e=1e-3). MSA and templates are fed to our 1D prediction systems (all
based on BRNN): Porter1,4,6 (secondary structure), PaleAle4,6 (solvent accessibility), BrownAle4
(contact density), Porter+2 (structural motifs). All predictors use template information as an input
alongside the sequence and MSA.
1D predictions are combined into a structural fingerprint4 (SAMD) which, alongside the
PSSM, is used to find remote homologues in the PDB through 6 searches (PSSM and SAMD
profile against PDB sequences and SAMD, with 3 different substitution matrices, plus 3 more
searches against PDB PSSM rather than sequences).
In the following stage residue contact maps are predicted by a system based on 2D-
Recursive Neural Networks (XXstout5). We predict binary maps with a contact threshold of 8Å
between Cβ, which are submitted to the RR category. Inputs for map prediction are: the
sequence; MSA; PSI-BLAST, SAMD and SAMD templates. That is, the maps are template-
based whenever suitable templates are found.
The 3D reconstruction, which is only conducted on Cα traces, is run as follows: we run a
SAMD search for templates with an e-value of 10,000; for each (overlapping) 9-mer of the
protein we gather the structures of the top 50 templates which fully cover it (SAMD_list); a
simulated annealing search of the conformational space is run by substituting snippets of 3 to 9
amino acids extracted from the SAMD_list to quickly find a minimum of a potential function
which rewards formation of contacts that appear in a weighed average of the distance maps of
templates; from the previous enpoint a low temperature refinement is run by substituting 9-mers
from the conformation with 9-mers from the SAMD_list, and using the same potential function
as above.
We run 30 reconstructions for each protein, which we rank by their weighed TM-scores
against the template list. For the 5 top-ranked models we reconstruct the backbone with
SABBAC, and the full atoms with Scwrl4, then run a brief energy minimisation by gromacs.
These are the models submitted to CASP.
It should be noted that everything in our pipeline (except BLAST and the software to
blow Cα traces into full-atom models) is in house, and that in normal conditions we can provide
predictions for a protein in tens of minutes.

36
Results
We await the CASP assessment.

Availability
http://distillf.ucd.ie/distill/

1. Pollastri,G. & McLysaght,A. (2005) Porter, A new, accurate server for protein secondary
structure prediction, Bioinformatics, 21(8), 1719–1720.
2. Mooney,C., Vullo, A. & Pollastri, G.. (2006) Protein Structural Motif Prediction in
Multidimensional φ-ψ Space leads to improved Secondary Structure Prediction, Journal of
Computational Biology, 13(8), 1489-1502.
3. Walsh,I., Martin, A.J.M., Mooney, C., Rubagotti, E., Vullo, A. & Pollastri, G. (2009). Ab
initio and homology based prediction of protein domains by recursive neural networks" BMC
Bioinformatics, 10,195.
4. Mooney, C. & Pollastri, G. (2009). Beyond the Twilight Zone: Automated prediction of
structural properties of proteins by recursive neural networks and remote homology
information, Proteins, 77(1), 181-90.
5. Walsh, I., Baú, D., Martin, A.J.M., Mooney, C., Vullo, A. & Pollastri, G. (2009). Ab initio
and template-based prediction of multi-class distance maps by two-dimensional recursive
neural networks, BMC Structural Biology, 9,5.
6. Mirabello, C. & Pollastri, G. "Porter, PaleAle 4.0: high-accuracy prediction of protein
secondary structure and relative solvent accessibility", Bioinformatics, 29(16):2056-2058,
2013

37
EdaRose

Estimation of Distribution and Distance Restraints for Fragment-based Protein Structure


Prediction

D. Simoncini 1,3, A.R.D. Voet2 and K.Y. J. Zhang3


1 - LISBP, Université de Toulouse, CNRS, INRA, INSA, Toulouse, France, 2 - Laboratory of biomolecular modelling
and design, Dpt of Chemistry, University of Leuven, Celestijnenlaan 200G 3001 Heverlee, Belgium, 3 – Structural
Bioinformatics Team, Division of Structural and Synthetic Biology, Center for Life Science Technologies, RIKEN 1-
7-22 Suehiro, Yokohama, Kanagawa 230-0045, Japan
david.simoncini@insa-toulouse.fr

EdaRose is a Rosetta fragment-based protein structure prediction protocol1 which uses an


estimation of distribution algorithm (EDA) to bias the selection of fragments during model
assembly. At each iteration, EdaRose updates the probabilities of selecting fragments for future
model assembly after analyzing a specific subset of already produced models. EdaRose differs
from our previous EDA-based protein structure prediction software EdaFold2 in many ways.
While the selection of models for the estimation of distribution was solely based on the energy of
the models in EdaFold, EdaRose uses a cluster-based selection scheme derived from the protein
structure clustering software Durandal3. The rationale for this selection scheme is to promote
diversity in low energy models and prevent premature convergence of the algorithm. As a
Rosetta protocol, EdaRose inherits a lot of options from the Rosetta modeling suite. For the
CASP12 experiments, we used distance restraints based on contact predictions made with the
PconsC24 software.

Methods
Distance restraints.
Distances restraints were derived from contact predictions made with PconsC2. For each target,
the top N/10 contacts were kept (N length of the sequence) and converted into the Rosetta
constraints file format.

Structure modeling.
For each target, between 24000 and 60000 models were produced using 100 CPU cores. Six
iterations of EdaRose were performed, each one producing 1/6th of the total number of models.
After each iteration, the models were clustered using a clustering radius of 4 Angstroms. The
Durandal protein clustering tool was modified to follow the procedure described below:
1. Select the lowest energy model as cluster center
2. Remove all models at less than 4 Angstrom from the cluster center
3. Iterate until all models have been removed
The top 100 lowest energy cluster centers were then analyzed to update the probability
distributions over the fragment library for subsequent iterations. For more details about the
estimation of distribution procedure, please refer to the paper describing EdaFold2. An option in
EdaRose allows the probability distributions to be reverted to their values at precedent iteration
when the algorithm detects that the energy of produced models no longer improve. This option
was enabled during CASP12 experiments.

38
Availability
EdaRose is available on demand as a patch for the Rosetta modeling software.

1. Leaver-Fay,A., Tyka,M., Lewis,S.M., Lange,O.F., Thompson,J., Jacak,R., Kaufman,K.W.,


Renfrew,P.D., Smith,C.A., Sheffler,W., Davis,I.W., Cooper,S., Treuille,A., Mandell,D.J.,
Richter,F., Ban,Y.A., Fleishman,S.J., Corn,J.E., Kim,D.E., Lyskov,S., Berrondo,M.,
Mentzer,S., Popović,Z., Havranek,J.J., Karanicolas,J., Das,R., Meiler,J., Kortemme,T.,
Gray,J.J., Kuhlman,B., Baker,D., Bradley,P. (2011) Rosetta3: An Object-Oriented Software
Suite for the Simulation and Design of Macromolecules, In: Michael L. Johnson and Ludwig
Brand, Editor(s), Methods in Enzymology, Academic Press, Volume 487, Pages 545-574.
2. Simoncini,D., Zhang,K.Y.J. (2013). Efficient sampling in fragment-based protein structure
prediction using an estimation of distribution algorithm. PLoS ONE, 8(7):e68954.
3. Berenger,F., Shrestha,R., Zhou,Y., Simoncini,D., Zhang,K.Y.Z. (2011). Durandal: fast exact
clustering of protein decoys. Journal of computational chemistry 33 (4), 471-474.
4. Skwark,M. J., Abdel-Rehim,A., Elofsson, A. (2013). "PconsC: combination of direct
information methods and alignments improves contact prediction". Bioinformatics, 29(14),
1815-1816.

39
Elofsson

Manual model ranking based on agreement with contact maps

Mirco Michel, Nanjiang Shu, David Menéndez Hurtado and Arne Elofsson
Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Stockholm
10691, Sweden
arne@bioinfo.se

Agreement of a structural model with its predicted contact map is an indicator for having
identified the correct fold. Or the other way around, if the model disagrees with an accurate
contact map, it is likely that the folding process did not converge into the native state 1. For the
Elofsson models we used a manual quality assessment to identify the best possible model among
all server models. In experience from earlier CASPs a method that reliable could identify the best
models would outperform the best individual method with 10-20%. For our quality analysis we
used a combination of single model quality assessments using ProQ32, consensus based quality
assessments using Pcons3 and agreement with contact predictions from PconsC34. The selection
for each target was done manually considering all these criteria.

Methods
All server models were refined
and relaxed using the Rosetta
relax protocol as in 2, before
calculating various QA scores
(Pcons, ProQ3, ProQ3.diso, and
Pcomb, see our QA abstract).
Then we compared each model
against the contact map predicted
by PconsC3 (see our RR abstract)
and calculated precision (PPV) of
the top ranked contacts. We used
two contact maps, the default one
based on an HHblits5 alignment
and another based on a
Jackhmmer6 alignment. This gave
us a variety of metrics which were
organized in an internal website,
such that it was easy to rank the
models by each score and to have
a quick look upon the structure of Figure 1: Screenshot of our internal website to show and rank models
each model along with residue- based on QA and agreement to predicted contact maps (measured in
wise local quality scores, see PPV) for an example target T0903. The top shows a part of the contact
map plots we used to assess their quality. The leftmost column shows
Figure 1. We then selected top an image of the structure, scores by various methods in the following
ranked models based on all scores columns and a plot of local QA scores in the rightmost column. In this
and put more weight on PPV case the contact map for the Jackhmmer alignment was missing.

40
ranking if we trusted the predicted contact map to be accurate, i.e. if the alignment was
sufficiently large and if it showed clear and typical contact patterns.

Results
Figure 2 shows the performance of
different predictors on 15 manual
targets that are available in PDB
today. For each method it sums up
TM-score of the models. The
leftmost bar shows the best
possible outcome, i.e. selecting the
model with highest TM-score for
each target, resulting in a sum of
9.42. Simply selecting the TS1
model from the BAKER-
ROSETTASERVER achieves a
score of 8.09 and thus highest
performance of all predictors on
this small subset of all targets. Our
best method so far is Pcomb with a
TM-score sum of 7.81. The
Elofsson predictions described in
this abstract achieve 7.72.
This means that based on
this data we did not identify a QA
method that outperforms all single
methods, nor did our manual
selections improve the selection. Figure 2: Sum of TM-scores for the respective model of all 15 manual targets
that we could identify in PDB at the time of writing this abstract. Blue bars:
model selection by QA and PPV, green: our prediction.
Availability
PconsC3 is available at
http://c3.pcons.net/. Pcons-net is available at http://pcons.net/.

1. Michel,M., Hayat,S., Skwark,M.J., Sander,C., Marks,D.S., Elofsson,A. (2014) PconsFold:


improved contact predictions improve protein models. Bioinformatics 30(17):i482-8
2. Uziela,K., Wallner,B., Elofsson,A. (2016) ProQ3: Improved model quality assessments using
Rosetta energy terms. ArXiv:1602.05832
3. Wallner,B., Elofsson,A. (2006). Identification of correct regions in protein models using
structural, alignment and consensus information. Protein Science 15(4):900-913.
4. Skwark,M.J., Michel,M., Hurtado,D.M., Ekeberg,M., Elofsson,A. (2016) Accurate contact
predictions for thousands of protein families using PconsC3. Submitted
5. Remmert,M., Biegert,A., Hauser,A., Söding,J. (2012) HHblits: lightning-fast iterative protein
sequence searching by HMM-HMM alignment. Nature Methods 9, 173-175
6. Finn,R.D., Clements,J., Eddy,S.R. (2011). HMMER web server: interactive sequence
similarity searching. Nucleic Acids Research 39 (suppl 2): W29-W37

41
FALCON_COLORS

Improving residue-residue contact prediction via low-rank and sparse decomposition of


residue correlation matrix

Haicang Zhang1, Yujuan Gao2, Minghua Deng2, Chao Wang1, Jianwei Zhu1, Qi Zhang1, Wei-
Mou Zheng3,* and Dongbo Bu1,*
1 - Institute of Computing Technology, Chinese Academy of Sciences, 2 - Peking University, Beijing, China, 3 -
Institute of Theoretical Physics, Chinese Academy of Sciences,
FALCON@ict.ac.cn

Strategies for correlation analysis in protein contact prediction often encounter two challenges,
namely, the indirect coupling among residues, and the background correlations mainly caused by
phylogenetic biases. While various studies have been conducted on how to disentangle indirect
coupling, the removal of background correlations still remains unresolved. Here, we present an
approach for removing background correlations via low-rank and sparse decomposition (LRS) of
a residue correlation matrix. The correlation matrix can be constructed using either local
inference strategies (e.g., mutual information, or MI) or global inference strategies (e.g., direct
coupling analysis, or DCA). In our approach, a correlation matrix was decomposed into two
components, i.e., a low-rank component representing background correlations, and a sparse
component representing true correlations. Finally, the residue contacts were inferred from the
sparse component of correlation matrix.

Methods
To apply the LRS technique for protein contacts prediction, we first built a matrix to measure
correlations among residues in the target protein. The residue correlation measure can be
calculated by using local statistical models (e.g., MI and OMES) or global statistical models
(e.g., DCA and PSICOV). Next, by using the LRS technique, we decomposed the residue
correlation matrix into a low-rank component plus a sparse component. The sparse component
was then used to infer residue-residue contacts in the target protein.

1. Residue correlation matrix construction


The correlation measures are usually derived from MSA information by using local statistical
models or global statistical models. The correlation matrices reported by PSICOV1, mfDCA2,
and plmDCA3,4 were used as representatives of global statistical models. As for local statistical
models, we focused on the widely-used MI and OMES correlation measures. In addition, we
designed another correlation measure, called COV, based on empirical covariance.
Similar to OMES, COV measures the difference between observed versus expected
frequency of residue pairs; however, COV calculates the L1 norm of the frequency difference. It
should be pointed out that in the study, MSAs have already been re-weighted before calculating
residue correlations. The reason is that the re-weighting strategy has also been reported helpful
for improving contact prediction precision, although it was designed to mitigate potential
redundancy in MSA.

2. Low-rank and sparse matrix decomposition

42
Suppose we have already calculated a residue correlation matrix M, where the entry Mij
quantifies the correlation between residues at the ith and jth sites using one of the measures
mentioned above. We decomposed the matrix M into two components, i.e., M = L+S, where L is
a low-rank matrix describing background correlations, and S is a sparse matrix describing true
correlations. Thus, the non-zero entries in S-for instance, Sij-represents the propensity of the
contact between the ith and jth residues.
The optimal decomposition can be determined via solving the following optimization
problem:

minimize ||L||* +λ||S||1,


subject to L+S=M.

The objective function consists of two terms. The first term measures the rank of L using its
nuclear norm, i.e., the sum of its singular values, and generally the smaller of the nuclear form,
the lower rank of L. The second term measures the sparsity of S using its L1 norm. In the study,
the objective function was minimized by virtue of the inexact augmented Lagrange multiplier
technique, and the parameter λ was introduced to balance the two terms. In particular, the
sparsity of S was emphasized by setting a large l, while the low- rank of L was emphasized by
setting a small λ. The optimal setting of λ was obtained based on a training dataset.

Results
We trained our LRS-based method on the PSICOV dataset, and tested it on both GREMLIN and
CASP11 datasets. Our experimental results suggested that LRS significantly improves the
contact prediction precision. For example, when equipped with the LRS technique, the prediction
precision of MI and mfDCA increased from 0.25 to 0.67 and from 0.58 to 0.70, respectively (Top
L/10 predicted contacts, sequence separation: 5 AA, dataset: GREMLIN). In addition, our LRS
technique also consistently outperforms the popular denoising technique APC (average product
correction), on both local (MI_LRS: 0.67 vs MI_APC: 0.34) and global measures
(mfDCA_LRS: 0.70 vs mfDCA_APC: 0.67). Interestingly, we found out that when equipped
with our LRS technique, local inference strategies performed in a comparable manner to that of
global inference strategies, implying that the application of LRS technique narrowed down the
performance gap between local and global inference strategies. Overall, our LRS technique
greatly facilitates protein contact prediction by removing background correlations.

Availability
http://protein.ict.ac.cn/COLORS5

1. Jones, D. T., Buchan, D. W., Cozzetto, D., & Pontil, M. (2012). PSICOV: precise structural
contact prediction using sparse inverse covariance estimation on large multiple sequence
alignments. Bioinformatics 28(2): 184-190.
2. Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D. S., Sander, C., ... & Weigt, M.
(2011). Direct-coupling analysis of residue coevolution captures native contacts across many
protein families. Proceedings of the National Academy of Sciences 108(49): 1293-1301.

43
3. Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact
prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E
87(1): 012707.
4. Ekeberg, M., Hartonen, T., & Aurell, E. (2014). Fast pseudo-likelihood maximization for
direct-coupling analysis of protein structure from many homologous amino-acid sequences.
Journal of Computational Physics 276: 341-356.
5. Skwark, M. J., Raimondi, D., Michel, M., & Elofsson, A. (2014). Improved contact
predictions using the recognition of protein like contact patterns. PLoS Comput. Biol. 10(11):
e1003889.

44
FALCON_TOPO

Improving Protein Threading Accuracy via Combining Local and Global Potential Using
TreeCRF Model

Haicang Zhang1, Chao Wang1, Jianwei Zhu1, Qi Zhang1, Wei-Mou Zheng2 and Dongbo Bu1,*
1 - Institute of Computing Technology, Chinese Academy of Sciences, 2 - Institute of Theoretical Physics, Chinese
Academy of Sciences
FALCON@ict.ac.cn

Threading methods usually identify the most likely fold of a target protein by building sequence-
structure alignments between this protein and template proteins. These alignments are calculated
based on local information of each residue in target and template proteins as well as global
interactions among residues. However, how to incorporate long-distance interaction into
threading leads to a dilemma: on one side, the consideration of all residue-residue interactions
make alignment computationally intractable; on the other side, the ignorance of residue-residue
interactions will reduce the alignment accuracy. In this study, we make a comprise using a
statistical model called TreeCRF (Tree Conditional Random Fields). Specifically, we select a set
of nested residue-residue contacts by removing less important contacts, and then build
alignments, with these nested contacts into consideration, by an efficient dynamic programming
algorithm called Tree-Viterbi. This way, our approach achieves both high accuracy and high
efficiency simultaneously. We implemented the TreeCRF model into TreeThreader. Experimental
results on benchmark datasets show that the accuracy of the alignments built by TreeThreader is
higher than HHpred.

Methods
Given a sequence of a target protein, TreeThreader predict its tertiary structure as follows.

1. Select the most informative contact pairs of the template using dynamic programming.
For each template protein, we first calculate all residue-residue contact and measure the
importance of each contact using a metric called contact potential. Next, we extract nested
contacts by removing leas important contacts. The optimal nested contacts are calculated using
the dynamic programming technique.

2. Align target protein with template protein using TreeCRF model


To align a target protein against template proteins with nested contacts, we proposed Tree-
Viterbi, a dynamic programming algorithm specially designed for nested graph formed by these
contacts. Briefly, the Viterbi algorithm for sequences employ backward/forward procedures.
Similarly, we designed inside/outside procedures for nested graph. Our Tree-Viterbi algorithm
calculates the optimal alignment in O(Kmn) time, where K denotes the number of the selected
contact pairs of the template, m and n denote the length of the template and the target protein,
respectively.

3. Build tertiary structures and model selection

45
The generated alignments were fed into MODELLER to build tertiary structure of the target
protein. The predicted models are ranked according to dDFIRE2 energy function. For free-
modeling targets, we run FALCON3 to generate several models and select the best ones using
ROSETTA4 energy function.

Results
We evaluated our approach on two benchmark datasets, namely, a PDB25 dataset consisting of
600 proteins, and a NEW70 dataset consisting of 1263 proteins. All proteins in the PDB25
dataset were annotated with reference alignment calculated using TMalign. These reference
alignments were used to evaluate the generated alignments. We randomly selected 300 proteins
as training set to train our TreeCRF model and used the remaining 300 proteins as testing set. On
these testing set, TreeThreader generates alignments with higher accuracy than HHpred (GDT:
0.36 vs. 033).

The NEW70 dataset consists of 1263 proteins with their native structures released and deposited
into PDB after the construction of template database used in this study. This way, there is no
overlap between the NEW70 dataset and the template database. For these proteins, we run
TreeThreader and HHpred to predict tertiary structures. Experimental results show that
TreeThreader exhibits an average GDT TS score of 0.68, which is higher than HHpred (0.66).

In summary, our TreeThreader approach shows high accuracy in building alignment by


considering long-distance contacts. In addition, our approach also shows high efficiency by
extracting the nested contacts and employing an efficient Tree-Viterbi algorithm to calculate the
optimal alignment.

Availability
TreeThreader was incorporated into FALCON@home, a volunteer platform with over 20,000
volunteer CPUs from all of the world. The server is available through
http://protein.ict.ac.cn/FALCON5

1. Söding, J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics


21 (7): 951–960.
2. Yang, Y., & Zhou, Y. (2008). Specific interactions for ab initio folding of protein terminal
regions with secondary structures. Proteins: Structure, Function, and Bioinformatics 72(2):
793-803.
3. Li, S. C., Bu, D., Xu, J., & Li, M. (2008). Fragment‐HMM: A new approach to protein
structure prediction. Protein Science 17(11), 1925-1934
4. Simons, K. T., Kooperberg, C., Huang, E., & Baker, D. (1997). Assembly of protein tertiary
structures from fragments with similar local sequences using simulated annealing and
Bayesian scoring functions. J. Mol. Biol. 268(1), 209-225.
5. Wang, C., Zhang, H., Zheng, W. M., Xu, D., Zhu, J., Wang, B., Ning, K., Sun, S., Li, S. C. &
Bu, D. (2016). FALCON@ home: a high-throughput protein structure prediction server based
on remote homologue recognition. Bioinformatics 32(3): 462-464.

46
Faraggi

Eshel Faraggi 1, Andrzej Kloczkowski 2


1- IUPUI, 2- OSU, NCH

We have participated in the Critical Assessment of Protein Structure Prediction (CASP)


experiment with four prediction procedures. The procedure described in this abstract is labeled as
group "Faraggi", number 363. This method is based on new version of the Seder program [1,2],
with new and improved input features as will be described in an upcoming manuscript. For this
procedure Seder was trained with soft and hard protein targets. We use CASP5 through CASP10
server models for training data, and CASP11 server models as a test set. That is, in this case we
train over all CASP targets and optimize the prediction on both hard and soft CASP11 targets.
This version of Seder is then used to pick among all CASP12 submitted server models. To
estimate the B-factors, for the protein models we used the following equation: B-factor = 300 *
SPXASA / ( 1 + model-residue-depth), with SPXASA the SPINE-X [3] predicted accessible
surface area, and model-residue-depth is the residue depth reported from the program DEPTH
[4]. This model came from approximately fitting a distribution of experimental B-factors.

1. Faraggi, Eshel, and Andrzej Kloczkowski. "A global machine learning based scoring function
for protein structure prediction." Proteins: Structure, Function, and Bioinformatics 82.5
(2014): 752-759.
2. Manuscript in preparation.
3. Faraggi, Eshel, et al. "SPINE X: improving protein secondary structure prediction by multistep
learning coupled with prediction of solvent accessible surface area and backbone torsion
angles." Journal of computational chemistry 33.3 (2012): 259-267.
4. Tan, Kuan Pern, Raghavan Varadarajan, and Mallur S. Madhusudhan. "DEPTH: a web server
to compute depth and predict small-molecule binding cavities in proteins." Nucleic acids
research 39.suppl 2 (2011): W242-W248.

47
FDUBio

Sorting protein decoy models by learning-to-rank

Xiaoyang Jing1, Kai Wang2, Ruqian Lu1 and Qiwen Dong3


1-School of Computer Science, Fudan University, Shanghai 200433, People's Republic of China, 2-College of
Animal Science and Technology, Jilin Agricultural University, Changchun 130118, People's Republic of China, 3-
Institute for Data Science and Engineering, East China Normal University, Shanghai 200062, People's Republic of
China
qwdong@fudan.edu.cn

Protein model quality assessment is an important open problem in protein structure. We develop
a novel quasi single-model method based on learning-to-rank algorithm for protein model quality
assessment1 which first sorts the decoy models to indicate the relative qualities for the protein
native structure and then takes the first five decoy models as reference models to predict other
model qualities by using average GDT_TS scores between the reference models and other
models.

Methods
The proposed method formulates the protein model quality assessment task as a ranking task,
and the protein decoy models are ranked by their similarities with the corresponding native
structure. Such similarities can be measured by various structure comparison methods and the
GDT_TS score2 is adopted here. Two kinds of features are extracted from the decoy models:
well-established knowledge-based mean force potentials and the evaluation scores of other state-
of-the-art programs for model quality assessment of proteins. The knowledge-based potentials
used in the proposed method include Boltzmann-based potentials3, the DFIRE potential4, the
DOPE potential5, the GOAP potential6 and the RWplus potential7. The evaluation scores from
other model quality assessment programs are also extracted as additional features, which include
the Frst8, ProQ9, RFMQA10, SIFT11 and SELECTpro12 software. Each decoy model is
represented as a feature vector in which the elements are taken from the features. For the
learning-to-rank algorithm, the pairwise ranking via-classification approach has been adopted to
model quality assessment, and the SVMrank program13 is used as the implement. The kernel
function is taken as the linear kernel and the parameters are optimized with five-fold cross
validation on the 3DRobot dataset14.
For a certain protein, all of its decoy models are represented as feature vectors which are
taken as instances firstly. These instances are then inputted into learning-to-rank algorithm to
predict the relative ranking relation of any two decoys from the same protein. Finally, the
proposed quasi single-model method takes the first five decoy models ranked by the learning-to-
rank algorithm as the reference models and the predicted qualities of other models are the
average GDT_TS score of the target models with the five reference models.

Availability
The package of the proposed method is freely available at
http://www.iipl.fudan.edu.cn/staff/dongqw/MQAPRank.html.

48
1. Jing X, Wang K, Lu R, Dong Q. (2016). Sorting protein decoys by machine-learning-to-rank.
Scientific Reports. 6, 31571.
2. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A. (2014). Critical assessment
of methods of protein structure prediction (CASP)—round x. Proteins: Structure, Function,
and Bioinformatics. 82, 1-6.
3. Qiwen D, Shuigeng Z. (2011). Novel Nonlinear Knowledge-Based Mean Force Potentials
Based on Machine Learning. Computational Biology and Bioinformatics, IEEE/ACM
Transactions on. 8, 476-86.
4. Zhou H, Zhou Y. (2002). Distance‐scaled, finite ideal‐gas reference state improves
structure‐derived potentials of mean force for structure selection and stability prediction.
Protein Science. 11, 2714-26.
5. N E, B W, MA M-R, MS M, D E, MY S, et al. (2006). Comparative Protein Structure
Modeling Using Modeller. chapter 2, 5.6.1-5.6.30.
6. Zhou H, Skolnick J. (2011). GOAP: A Generalized Orientation-Dependent, All-Atom
Statistical Potential for Protein Structure Prediction. Biophysical Journal. 101, 2043-52.
7. Zhang J, Zhang Y. (2010). A novel side-chain orientation dependent potential derived from
random-walk reference state for protein fold selection and structure prediction. PloS one. 5,
e15386.
8. Tosatto SCE. (2005). The victor/FRST function for model quality estimation. Journal of
computational biology : a journal of computational molecular cell biology. 12, 1316.
9. Wallner B, Elofsson A. (2003). Can correct protein models be identified? Protein Science. 12,
1073-86.
10. Manavalan B, Lee J, Lee J. (2014). Random Forest-Based Protein Model Quality Assessment
(RFMQA) Using Structural Features and Potential Energy Terms. PLoS ONE. 9, e106542.
11. Adamczak R, Meller J. (2004). On the transferability of folding and threading potentials and
sequence-independent filters for protein folding simulations. Mol Phys. 102, 1291-305.
12. Randall A, Baldi P. (2008). SELECTpro: effective protein model selection using a structure-
based energy function resistant to BLUNDERs.(Research article). BMC Structural Biology.
8, 52.
13. Joachims T. (2006). Training linear SVMs in linear time. Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data mining. p.217-26.
14. Deng H, Jia Y, Zhang Y. (2015). 3DRobot: automated generation of diverse and well-packed
protein structure decoys. Bioinformatics. btv601.

49
FEIG

(loc)PREFMD: Protein Structure Refinement via Molecular Dynamics Simulations

Michael Feig
Department of Biochemistry & Molecular Biology, Michigan State University, East Lansing, MI, USA
feig@msu.edu

During CASP12, we followed up on our previous successes with molecular dynamics (MD)
centered protein structure refinement. The major challenges towards improved refinement
success in CASP10 and CASP11 were the use of restraints that limited more extensive
refinement, especially in loop regions, and a relatively poor local stereochemical quality after
generating models via ensemble averaging. Our major focus during CASP12 was to try more
aggressive protocol variants to explore how to achieve more extensive refinement without
compromising reliability.
Generally, our protocol consisted of series of extensive MD simulations of fully solvated
protein structures in explicit solvent and carried out with the CHARMM force field. During
CASP12 we expanded the lengths of the simulations to 2-8 s/per target depending on target
size and resource availability (vs. about 1 s/target during CASP11). We used weaker
harmonic restraints compared to CASP11 to allow broader exploration of conformational space
while still preventing complete departure from the initial models. We also used a new version
of the CHARMM force field1 that is meant to improve the sampling of non-canonical structural
regions with the idea that sampling of loops would be improved. After generating the initial
ensemble we used our established protocol of subset ensemble averaging based on the RW+
scoring function2 and distance from the initial model to obtain a refined model3. Predictions
resulting from this protocol were submitted as Model 3. To explore whether we can extend
refinement we added a second round of refinement where we ran additional simulations over
100 ns each from clusters of dominant states identified via Markov State Modeling. The
ensemble of states obtained from the second round was subset averaged again resulting in
Model 2 submissions. Combining the initial ensemble with the additional sampling started
from the cluster centers was used to generate Model 1 submissions as our best guess for overall
most successful refinement based on initial tests. We furthermore tested a different idea for
expanding refinement progress that relied on the idea that the initially refined models are likely
improved in the right direction and that structures that follow this direction further may be
overall refined further. This was tested by generating an additional ensemble starting from the
refined model from round 1 without any restraints that was then filtered according to the
vectorial direction given by the initial and refined models (Model 4 submissions).
We also applied a new local refinement protocol (locPREFMD)4 that specifically targets
problems with local stereochemistry. This protocol was applied to all predictions before
submission but we also submitted just the application of locPREFMD to the CASP-given
structures to test its performance separately.

1. Huang, J., Rauscher, S., Nawrocki, G., Ran, T., Feig, M., de Groot, B. L., Grubmüller, H.
& MacKerell Jr., A. D. CHARMM36m: An Improved Force Field for Folded and
Intrinsically Disordered Proteins. under review (2016).

50
2. Zhang, Y. & Skolnick, J. Scoring Function for Automated Assessment of Protein Structure
Template Quality. Proteins 68, 1020-1020 (2007).
3. Feig, M. & Mirjalili, V. Protein Structure Refinement via Molecular-Dynamics
Simulations: What Works and What Does Not? Proteins, doi:10.1002/prot.24871 (2015).
4. Feig, M. Local Protein Structure Refinement via Molecular Dynamics Simulation with
locPREFMD. Journal of Computational Information and Modeling 56, 1304-1312 (2016).

51
FLOUDAS

Constrained grey box global optimization for protein structure prediction

C.A. Kieslich1,2,*, U. Shah1,2, M. Onel1,2, J.Souvaliotis1,2, C.A. Floudas1,2,+


1-Artie McFerrin Department of Chemical Engineering, Texas A&M University, 2-Texas A&M Energy Institute,
Texas A&M University; * Current affiliation: Coulter Department of Biomedical Engineering, Georgia Institute of
Technology; +deceased
kieslich@gatech.edu

It has been known for some time in the area of model quality assessment (QA), that when
considering an ensemble of structural models, mean similarity to all other models is highly
correlated with model accuracy. For evaluation of QA methods in CASP10, mean--pairwise GDT
(mpwGDT) was used as a naïve standard of performance. Despite its surprisingly high accuracy,
mpwGDT has no explicit mathematical form making it impossible for mpwGDT to be
incorporated into traditional approaches for conformational sampling. However, mpwGDT can
be used in grey box optimization, where a surrogate model is fit to a limited set of known
samples and then the surrogate model is minimized/maximized. Grey box optimization methods
are specifically intended for applications where the mathematical form of the problem at hand is
too complex or unknown. To this end, we have developed and applied a method based on grey
box optimization with ARGONAUT1 for sampling protein conformations according to the
mpwGDT landscape, rather than a traditional potential energy function.

Methods
We have developed a platform for protein structure prediction that integrates sequence- and
structural template-based features and utilizes constrained grey-box global optimization through
ARGONAUT to traverse conformational space. The approach is hierarchical in nature, and
includes improved methods for the prediction of secondary structure, tertiary contacts, beta-sheet
topology, and tertiary structure prediction. Secondary structure prediction is based on our SVM-
based consensus method, conSSert2, which has been optimized for accurate prediction of helices
and strands. Secondary structure predictions from conSSert are used to search a structural
template library, using a modified implementation of HHsuite3. The prediction of tertiary
contacts is based on the integration of a random forest model that utilizes coevolution scores and
sequence-based features, and tertiary contacts extracted from the structural templates identified
by conSSert/HHsuite. The β-sheet topology method takes as input secondary structure and
tertiary contact predictions and utilizes two MILP models: (i) a MILP model for strand pair
alignment, and (ii) a MILP model for β-sheet topology that explicitly accounts for hydrogen-
bond patterns. For ab initio tertiary structure prediction, initial models are first generated based
on several subsets of the predicted tertiary contacts, including β-sheet topology contacts.
Predicted contacts are used as distance restraints, which serve as input into CYANA4 for the
generation of an initial model. Each initial CYANA model is refined using Rosetta5 FastRelax,
and mpwGDT is computed for all decoys using TMscore6. Subsequently, ARGONAUT is used
to fit and optimize surrogate models to minimize a structure-based objective function according
the pairwise-distances of predicted residue contacts. Constraints are used to exclude regions of
the conformational space where pairwise-distances are conflicting. Solutions of the surrogate

52
models are added to the set of predicted structures, and used in the next iterations of model
fitting and optimization.

Availability
A webserver for conSSert secondary structure predictions is available at
http://ares.tamu.edu/conSSert/.

1. Boukouvala, F.; Floudas, C. A. ARGONAUT: AlgoRithms for Global Optimization of


coNstrAined grey-box compUTational problems. Optimization Letters 2016, Online.
2. Kieslich, C. A; Smadbeck, J.; Khoury, G. A; Floudas, C. A. conSSert: Consensus SVM
Model for Accurate Prediction of Ordered Secondary Structure. Journal of Chemical
Information and Modeling 2016, 56 (3), 455-461.
3. Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 2005,
21: 951-960.
4. Guntert, P. Automated NMR structure calculation with CYANA. Methods in Molecular
Biology 2004, 278, 353-378.
5. Leaver-Fay, A., Tyka, M., Lewis, S. M., Lange, O. F., Thompson, J., Jacak, R., Kaufman, K.,
Renfrew, P. D., Smith, C. A., Sheffler, W., Davis, I. W., Cooper, S., Treuille, A., Mandell, D.
J., Richter, F., Ban, Y. E., Fleishman, S. J., Corn, J. E., Kim, D. E., Lyskov, S., Berrondo, M.,
Mentzer, S., Popovic, Z., Havranek, J. J., Karanicolas, J., Das, R., Meiler, J., Kortemme, T.,
Gray, J. J., Kuhlman, B., Baker, D. & Bradley, P. ROSETTA3: an object-oriented software
suite for the simulation and design of macromolecules. Methods in Enzymology 2011, 487,
545-74.
6. Zhang, Y.; Skolnick, J. Scoring function for automated assessment of protein structure
template quality, Proteins 2004, 57, 702-710.

53
FLOUDAS_REFINESERVER

Princeton_TIGRESS 2.0: Protein Geometry Refinement Using Simulations and Support


Vector Machines

M. Onel1,2, U. Shah1,2, C.A. Kieslich1,2,*, C.A. Floudas1,2,+


1-Artie McFerrin Department of Chemical Engineering, Texas A&M University, 2-Texas A&M Energy Institute,
Texas A&M University; * Current affiliation: Coulter Department of Biomedical Engineering, Georgia Institute of
Technology; +deceased
kieslich@gatech.edu

Despite numerous computational1-7 and even collaborative8 approaches to the protein refinement
problem reported in the previous three Critical Assessment of Techniques for Protein Structure
Prediction (CASPs) (2008-2012), an overwhelming majority of methods degrade models rather
than improve them. This is a result of many generated models being significantly similar to one
another and the inability of current forcefields to distinguish between such better or worse
models. We first developed Princeton_TIGRESS9 for blind predictions during CASP10 and
subsequently improved both the support vector machines model and MD simulations component
for CASP11.

Methods
The protocol begins with the derivation of constraints from the input structure which include
backbone Cα-Cα, disulfide bridge, α-helix hydrogen bond, and β-sheet hydrogen bond
constraints. Next, a decision was made based on the sequence length. For structures with
sequence lengths less than or equal to 154 amino acids, the procedure proceeded with CYANA10
sampling, Rosetta FastRelax11 relaxation, SVM filtering, scoring with the dDFIRE12 energy
function, and a constrained molecular dynamics simulation in CHARMM. The CYANA
sampling stage aims to standardize the input structures coming from a multitude of servers which
each uses their own sampling methods, scoring functions, and rotamer libraries. An ensemble of
2000 structures was generated via Rosetta FastRelax. This ensemble was then evaluated for the
changes in features with respect to the input template structure. The structurally relevant features
include SASA, # of hydrogen bonds, decompositions of multiple physics-based and statistical
energy functions, and geometric criteria that partition distances between residue pairs in fine
distance bins. The SVM model was trained based on all CASP 8, 9, and 10 refinement targets
and selects whether the structures were “refined” or “not refined.” The trained model aimed to
have a low false positive detection rate. The all-atom MD stage in CHARMM13 was enhanced to
sample longer for larger structures. Additionally, the weights of the constraints were refined
depending on sequence length and starting GDT_TS. If the structure was larger than the cutoff of
154 residues, the procedure proceeded directly to the CHARMM MD stage. The procedure
results in a single refined structure with improved model accuracy and quality as benchmarked
previously9. The protocol was followed consistently with no manual intervention.

Availability
Structure modelers are welcome to submit to the Princeton_TIGRESS 2.0 webserver at
http://atlas.engr.tamu.edu/refinement/.

54
1. Bhattacharya, D. & Cheng, J. (2013). 3Drefine: Consistent protein structure refinement by
optimizing hydrogen bonding network and atomic‐level energy minimization. Proteins:
Structure, Function, and Bioinformatics 81, 119-131.
2. Rodrigues, J. P., Levitt, M. & Chopra, G. (2012). KoBaMIN: a knowledge-based
minimization web server for protein structure refinement. Nucleic Acids Research 40, W323-
W328.
3. Raval, A., Piana, S., Eastwood, M. P., Dror, R. O. & Shaw, D. E. (2012). Refinement of
protein structure homology models via long, all-atom molecular dynamics simulations.
Proteins: Structure, Function, and Bioinformatics 80, 2071-2079.
4. Zhang, J., Liang, Y. & Zhang, Y. (2011). Atomic-level protein structure refinement using
fragment-guided molecular dynamics conformation sampling. Structure 19, 1784-95.
5. Nugent, T., Cozzetto, D. & Jones, D. T. (2014). Evaluation of predictions in the CASP10
model refinement category. Proteins: Structure, Function, and Bioinformatics 82, 98-111.
6. Mirjalili, V., Noyes, K. & Feig, M. (2014). Physics based protein structure refinement
through multiple molecular dynamics trajectories and structure averaging. Proteins:
Structure, Function, and Bioinformatics 82, 196-207.
7. Heo, L., Park, H. & Seok, C. (2013). GalaxyRefine: protein structure refinement driven by
side-chain repacking. Nucleic Acids Research, W384–W388.
8. Khoury, G. A., Liwo, A., Khatib, F., Zhou, H., Chopra, G., Bacardit, J., Bortot, L., Delbum,
A. C. B., Deng, X., Faccioli, R., He, Y., Krupa, P., Li, J., Mozolewska, M., Baker, D., Cheng,
J., Floudas, C. A., Keasar, C., Levitt, M., Popavić, Z., Scheraga, H. A., Skolnick, J., Crivelli,
S. N. & Players, F. (2014). WeFold: A Coopetition for Protein Structure Prediction. Proteins:
Structure, Function, Bioinformatics 82, 1850-1868.
9. Khoury, G. A., Tamamis, P., Pinnaduwage, N., Smadbeck, J., Kieslich, C. A. & Floudas, C.
A. (2014). Princeton_TIGRESS: Protein geometry refinement using simulations and support
vector machines. Proteins: Structure, Function, and Bioinformatics 82, 794-814.
10. Guntert, P. (2004). Automated NMR structure calculation with CYANA. METHODS IN
MOLECULAR BIOLOGY-CLIFTON THEN TOTOWA- 278, 353-378.
11. Leaver-Fay, A., Tyka, M., Lewis, S. M., Lange, O. F., Thompson, J., Jacak, R., Kaufman, K.,
Renfrew, P. D., Smith, C. A., Sheffler, W., Davis, I. W., Cooper, S., Treuille, A., Mandell, D.
J., Richter, F., Ban, Y. E., Fleishman, S. J., Corn, J. E., Kim, D. E., Lyskov, S., Berrondo, M.,
Mentzer, S., Popovic, Z., Havranek, J. J., Karanicolas, J., Das, R., Meiler, J., Kortemme, T.,
Gray, J. J., Kuhlman, B., Baker, D. & Bradley, P. (2011). ROSETTA3: an object-oriented
software suite for the simulation and design of macromolecules. Methods in Enzymology
487, 545-74.
12. Yang, Y. & Zhou, Y. (2008). Specific interactions for ab initio folding of protein terminal
regions with secondary structures. Proteins: Structure, Function, and Bioinformatics 72, 793-
803.
13. MacKerell, J., A. D., Brooks, B., Brooks, III, C.L., Nilsson, L., Roux, B., Won, Y., and
Karplus, M. (1998). CHARMM: The Energy Function and Its Parameterization with an
Overview of the Program. In The Encyclopedia of Computational Chemistry (al., P. v. R. S.
e., ed.), Vol. 1, pp. 271-277. John Wiley & Sons: Chichester.

55
FLOUDAS_SERVER

Fast and Accurate Template-Based Protein Structure Prediction

C.A. Kieslich1,2,*, M. Onel1,2, U. Shah1,2, C.A. Floudas1,2,+


1-Artie McFerrin Department of Chemical Engineering, Texas A&M University, 2-Texas A&M Energy Institute,
Texas A&M University; * Current affiliation: Coulter Department of Biomedical Engineering, Georgia Institute of
Technology; +deceased
kieslich@gatech.edu

FLOUDAS_SERVER is an automated template-based protein structure prediction method that


produces 5 model predictions for each target.

Methods
First, secondary structure predictions were performed using conSSert1. Sequence alignments of
plausible templates were generated using the conSSert predictions and a modified version of
HHsuite2. After obtaining the top template alignments, constraints were derived for backbone
Cα-Cα and α-helix hydrogen bonds in CYANA3 format. Next, many local minima were
generated based on the derived constraints using torsion-angle dynamics in CYANA . The
models generated were scored using the dDFIRE4 energy function. Models with sequence
lengths less than 154 residues were refined using Rosetta FastRelax5. Tertiary contacts
predictions were also submitted based on tertiary contacts extracted from the structural
templates identified by conSSert/HHsuite. To extract general tertiary C(beta) contacts,
Delaunay triangulation was applied to identify C(beta) contacts in each identified structural
template, and a consensus score based on the template probability from HHsuite was used
to rank observed contacts. The protocol was followed with no manual intervention.

Availability: http://atlas.engr.tamu.edu/template/

1. Kieslich, C. A; Smadbeck, J.; Khoury, G. A; Floudas, C. A. conSSert: Consensus SVM


Model for Accurate Prediction of Ordered Secondary Structure. Journal of Chemical
Information and Modeling 2016, 56 (3), 455-461.
2. Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 2005,
21: 951-960.
3. Guntert, P. Automated NMR structure calculation with CYANA. Methods in Molecular
Biology 2004, 278, 353-378.
4. Yang, Y.; Zhou, Y. Specific interactions for ab initio folding of protein terminal regions with
secondary structures. Proteins: Structure, Function, and Bioinformatics 2008, 72, 793- 803.
5. Leaver-Fay, A., Tyka, M., Lewis, S. M., Lange, O. F., Thompson, J., Jacak, R., Kaufman, K.,
Renfrew, P. D., Smith, C. A., Sheffler, W., Davis, I. W., Cooper, S., Treuille, A., Mandell, D.
J., Richter, F., Ban, Y. E., Fleishman, S. J., Corn, J. E., Kim, D. E., Lyskov, S., Berrondo, M.,
Mentzer, S., Popovic, Z., Havranek, J. J., Karanicolas, J., Das, R., Meiler, J., Kortemme, T.,
Gray, J. J., Kuhlman, B., Baker, D. & Bradley, P. ROSETTA3: an object-oriented software
suite for the simulation and design of macromolecules. Methods in Enzymology 2011, 487,
545-74.

56
FONT

Improved Profile Construction Methods and Their Applications to 3D Structure Prediction


of Proteins

T. Oda1, T. Nakamura1,3, Y. Fukasawa1, K. Tomii1,2,3


1-Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST),
2-Biotechnology Research Institute for Drug Discovery, AIST, 3-Department of Computational Biology and Medical
Sciences, Graduate School of Frontier Sciences, The University of Tokyo
toshiyuki.oda@aist.go.jp, k-tomii@aist.go.jp

Profile-profile comparison is an effective method for template-based modeling because of its


power in similarity detection and its alignment accuracy. We performed template-based modeling
for CASP12 regular targets using our updated and enhanced profile-profile comparison method
with new profile construction pipelines and using our new structure assessment technique.

Methods
We have updated our profile-profile comparison method named FORTE1. We added two new
features: 1) Score corrections were introduced for positions which originally show an atypical
distribution of scores. 2) It can produce local alignments in addition to glocal alignments which
are produced by the previous version. According to our benchmark, the detection performance of
the updated version was generally higher than that of the previous one. However, the updated
version could miss some true positives detected by the previous version. Thus, we used both
versions in this CASP experiment.
To construct profiles of both targets and templates, we used PSI-BLAST2, DELTA-
BLAST3 and HHblits4. Input multiple sequence alignments (MSAs) for PSI-BLAST were
prepared using two methods. 1) The MSAs were aligned by using MAFFT5 with homologous
sequences detected by SSearch with MIQS6 against NCBI nr, and 2) the MSAs were obtained by
stacking pairwise alignments of structurally similar proteins/domains produced by TM-align7.
We revised the source code of PSI-BLAST to obtain better PSSM(s), as the original PSI-BLAST
could produce irregular scores in PSSM for a gap-rich MSA. With these profiles, both the
updated and previous versions of FORTE were performed and template candidates were
detected.
We constructed 3D models of the target proteins using MODELLER8 with templates
based on the top 10 (up to 100 for hard targets) alignments produced by FORTE. All predicted
models were scored using three assessment methods to evaluate the quality of protein models:
Verify3D, dDFire and our novel assessment technique based on the local structure classification.
In the model selection step, first, the constructed models which show low quality scores of
Verify3D were removed. Then, we selected 3D models with the following criteria: 1) Prioritize
templates with high Z-scores (>= 8.0), 2) Ranked templates based on the results by the
assessment methods, 3) Prioritize templates with the same or similar function to a target protein.
The orders of the submitting models were decided with human intervention.

Results
We assessed our models for T0868, T0869, T0872, T0900, T0904 and T0944 as their

57
experimental structures had been released by September 8th. They would be categorized into
template-based modeling targets (whose TM-scores of the best server models in the stage (2)
were higher than 0.4). Although we were able to construct models which could be ranked in top
10 in terms of GDT-HA among the severs' best models in the stage (2) for the targets: T0868,
T0869, T0872 and T0904, we could submit our best model only for T0868, suggesting that there
is room for improvement in model selection process.

1. Tomii, K., Hirokawa, T. & Motono, C. (2005). Protein structure prediction using a variety
of profile libraries and 3D verification. Proteins 61 Suppl 7, 114-21.
2. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman,
D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25, 3389-402.
3. Boratyn, G. M., Schaffer, A. A., Agarwala, R., Altschul, S. F., Lipman, D. J. & Madden, T.
L. (2012). Domain enhanced lookup time accelerated BLAST. Biol Direct 7, 12.
4. Remmert, M., Biegert, A., Hauser, A. & Soding, J. (2012). HHblits: lightning-fast iterative
protein sequence searching by HMM-HMM alignment. Nat Methods 9, 173-5.
5. Katoh, K. & Standley, D. M. (2013). MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. Mol Biol Evol 30, 772-80.
6. Yamada, K. & Tomii, K. (2014). Revisiting amino acid substitution matrices for
identifying distantly related proteins. Bioinformatics 30, 317-25.
7. Zhang, Y. & Skolnick, J. (2005). TM-align: a protein structure alignment algorithm based
on the TM-score. Nucleic Acids Res 33, 2302-9.
8. Webb, B. & Sali, A. (2014). Comparative Protein Structure Modeling Using MODELLER.
Curr Protoc Bioinformatics 47, 5 6 1-32.

58
GAPF_LNCC

META protein structure prediction with a multiple minima genetic algorithm

G.K. Rocha, F.L. Custódio, K.B. dos Santos, E. Correia, M. Miloski and L.E. Dardenne
Laboratório Nacional de Computação Científica, Petrópolis-­RJ, Brasil
gregkappaun@gmail.com

We built a modified version of our server workflow for protein structure prediction
(GAPF_LNCC_SERVER) based on the following idea: use the genetic algorithm to combine and
improve the models generated by the CASP12 participating servers. We employ a coarse-grained
representation where all backbone atoms are explicit, with the side chains modeled as a single
superatom. The scoring function combines some physically realistic potentials with knowledge
based terms to promote hydrogen bonding and secondary structure organization. Global
optimization is carried out by the multiple minima genetic algorithm (GA) and no further
refinement is performed. Selection of the models is then done by means of structural redundancy
filtering and energy pruning. The GAPF_LNCC META workflow was applied to 60 targets of
the “all groups” category.

Methods
The first step is to (1) download all STAGE2 Server predictions for the target. These models are
ranked according to their secondary structure content and the top 50 are used in the remaining
steps (the initial models). After that, a (2) secondary structure “prediction” file is built based on
the consensus secondary structure of the initial models, calculated by DSSP1. All following steps
use this file as the de facto secondary structure prediction. (3) Residue-residue (RR) contacts
prediction is made by METAPSICOV2. (3) Fragment libraries are created with Profrager3
(https://www.lncc.br/sinapad/Profrager/), and fragments are selected using the secondary
structure prediction in addition to the local sequence similarities from a culled database of
24,237 experimental structures.
The (5) initial populations for all runs of the conformational search consist of the initial
models from step 1. If any region of the model is not present, it is modeled by fragment
insertion. The search is carried out by GAPF4 which employs a genetic algorithm (GA) with
seven genetic operators including Ramachandran based mutations5 and fragment insertion. The
GA methodology uses a scoring function with a proper dihedral and a steric repulsion terms,
hydrophobic compaction, hydrogen bonding formation6, cooperative hydrogen bonding7 and RR
contacts. It employs a phenotype- based crowding mechanism for the maintenance of useful
diversity within the populations, which has been shown to result in increased performance and to
grant the algorithm multiple solution capabilities. For each target, we perform 10 independent
runs of the GA, each with populations containing 50 individuals, resulting in 500 structures.
These results undergo a (6) structural redundancy filter and the overall top five structures, ranked
by energy, proceeded to the next steps. (7) Side chains of the select structures are reconstructed
using SCWRL48. And finally the files are (8) formatted according to CASP guidelines, including
(9) filling the temperature column of the pdb files with the confidence in the prediction (0-1,
where 0 is the worst). All server models used are mentioned in the remarks of each submission.

59
Availability
Fragment library generation with Profrager is available at:
https://www.lncc.br/sinapad/Profrager/. Other tools are freely available from their authors. Other
steps in the protocol were carried out with experimental versions of our software that will be
made available by contacting the authors. Acknowledge: FAPERJ (grant E-
26/010.001229/2015), Santos Dumont supercomputer team, AMT, and Intel Brasil.

1. Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern
recognition of hydrogen‐bonded and geometrical features. Biopolymers, 22(12), 2577-2637.
2. Jones, D. T., Singh, T., Kosciolek, T., & Tetchner, S. (2015). MetaPSICOV: combining
coevolution methods for accurate prediction of contacts and long range hydrogen bonding in
proteins.Bioinformatics, 31(7), 999-1006.
3. Santos, K. B., Trevizani, R., Custodio, F. L., & Dardenne, L. E. (2015, January). Profrager
Web Server: Fragment Libraries Generation for Protein Structure Prediction. InProceedings
of the International Conference on Bioinformatics & Computational Biology (BIOCOMP) (p.
38). The Steering Committee of The World Congress in Computer Science, Computer
Engineering and Applied Computing (WorldComp).
4. Custódio, F. L., Barbosa, H. J., & Dardenne, L. E. (2014). A multiple minima genetic
algorithm for protein structure prediction.Applied Soft Computing, 15, 88-99.
5. Santos, K. B., Custódio, F. L., Barbosa, H. J., & Dardenne, L. E. (2015, August). Genetic
operators based on backbone constraint angles for protein structure prediction.
InComputational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2015
IEEE Conference on (pp. 1-8). IEEE.
6. Rocha, G. K., Custódio, F. L., Barbosa, H. J. C., & Dardenne, L. E. (2015, August). A
multiobjective approach for protein structure prediction using a steady-state genetic
algorithm with phenotypic crowding. In Computational Intelligence in Bioinformatics and
Computational Biology (CIBCB), 2015 IEEE Conference on (pp. 1-8). IEEE.
7. Levy-Moonshine, A., Amir, E. A. D., & Keasar, C. (2009). Enhancement of beta-sheet
assembly by cooperative hydrogen bonds potential. Bioinformatics, 25(20), 2639-2645.
8. Krivov, G. G., Shapovalov, M. V., & Dunbrack, R. L. (2009). Improved prediction of protein
side‐chain conformations with SCWRL4. Proteins: Structure, Function, and Bioinformatics,
77(4), 778-795.

60
GAPF_LNCC_SERVER

GAPF_LNCC_SERVER: a fully automated server for template-free protein structure


prediction with a multiple minima genetic algorithm

F.L. Custódio, G.K. Rocha, K.B. dos Santos, E. Correia, P.R.T Werdt, M. Miloski and L.E.
Dardenne
Laboratório Nacional de Computação Científica, Petrópolis-­RJ, Brasil
flc@lncc.br

We built a fully automated template--free PSP server (GAPF_LNCC_SERVER workflow) based


on two main ideas: (i) the use of information from non-homologous structures in the form of
residue-residue contact prediction, secondary structure prediction and fragments, and (ii) a
multiple minima genetic algorithm for conformational search. We employ a coarse-grained
representation where all backbone atoms are explicit, with the side chains modeled as a single
superatom. The scoring function combines some physically realistic potential with knowledge
based terms to promote hydrogen bonding and secondary structure organization. Global
optimization is carried out by the multiple minima genetic algorithm (GA) and no further
refinement is performed. Selection of the models is then done by means of structural redundancy
filtering and energy pruning. The GAPF_LNCC_SERVER workflow was applied to 88 targets,
of the “all groups” and “server only” CASP categories.

Methods
All accessory programs and tools are run locally on the server. The conformational search step is
run at the Santos Dumont cluster (http://www.sdumont.lncc.br/).
Our workflow starts with (1) secondary structure prediction by PSIPRED1 followed by (2)
domain prediction by INTERPROSCAN2. If step 2 returns domains longer than 200 residues, a
division is forced based on the predicted secondary structure. All steps after the second are
carried out for each domain separately. (3) Residue-residue (RR) contacts prediction is made by
METAPSICOV3 and an inter-domains contact map is created when applicable. (4) Fragment
libraries are created with Profrager4 (https://www.lncc.br/sinapad/Profrager/), and fragments are
selected using the secondary structure prediction in addition to the local sequence similarities
from a culled database of 24,237 experimental structures.
The (5) conformational search is carried out by GAPF5 which employs a genetic
algorithm (GA) with seven genetic operators including Ramachandran based mutations6 and
fragment insertion. The GA methodology uses a scoring function with a proper dihedral and a
steric repulsion terms, hydrophobic compaction, hydrogen bonding formation7, cooperative
hydrogen bonding8 and RR contacts. It employs a phenotype- based crowding mechanism for the
maintenance of useful diversity within the populations, which has been shown to result in
increased performance and to grant the algorithm multiple solution capabilities. For each target,
we perform 100 independent runs of the GA, each with populations containing 200 individuals,
resulting in 20,000 structures. These results undergo a (6) structural redundancy filter and the
overall top five structures, ranked by energy, proceeded to the next steps. If the target was split
into domains the (7) final structures are assembled with another run of GAPF where the initial
population is seeded with 50 random combinations of structures from step 6. The resulting

61
structures undergo another round of filtering identical to step 6. (8) Side chains of the select
structures are reconstructed using SCWRL49. And finally the files are (9) formatted according to
CASP guidelines, including (10) filling the temperature column of the pdb files with the
confidence in the prediction (0-1, where 0 is the worst).

Availability
Fragment library generation with Profrager is available at:
https://www.lncc.br/sinapad/Profrager/. Other tools are freely available from their authors. Other
steps in the protocol were carried out with experimental versions of our software that will be
made available by contacting the authors. The fully functional web server will be open by the
end of 2016. Acknowledgement: FAPERJ (grant E-26/010.001229/2015), Santos Dumont
supercomputer team, AMT, and Intel Brasil.

1. McGuffin, L. J., Bryson, K., & Jones, D. T. (2000). The PSIPRED protein structure
prediction server. Bioinformatics, 16(4), 404-405.
2. Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., & Lopez, R.
(2005). InterProScan: protein domains identifier. Nucleic acids research, 33(suppl 2), W116-
W120.
3. Jones, D. T., Singh, T., Kosciolek, T., & Tetchner, S. (2015). MetaPSICOV: combining
coevolution methods for accurate prediction of contacts and long range hydrogen bonding in
proteins.Bioinformatics, 31(7), 999-1006.
4. Santos, K. B., Trevizani, R., Custodio, F. L., & Dardenne, L. E. (2015, January). Profrager
Web Server: Fragment Libraries Generation for Protein Structure Prediction. InProceedings
of the International Conference on Bioinformatics & Computational Biology (BIOCOMP) (p.
38). The Steering Committee of The World Congress in Computer Science, Computer
Engineering and Applied Computing (WorldComp).
5. Custódio, F. L., Barbosa, H. J., & Dardenne, L. E. (2014). A multiple minima genetic
algorithm for protein structure prediction.Applied Soft Computing, 15, 88-99.
6. Santos, K. B., Custódio, F. L., Barbosa, H. J., & Dardenne, L. E. (2015, August). Genetic
operators based on backbone constraint angles for protein structure prediction.
InComputational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2015
IEEE Conference on (pp. 1-8). IEEE.
7. Rocha, G. K., Custódio, F. L., Barbosa, H. J. C., & Dardenne, L. E. (2015, August). A
multiobjective approach for protein structure prediction using a steady-state genetic
algorithm with phenotypic crowding. In Computational Intelligence in Bioinformatics and
Computational Biology (CIBCB), 2015 IEEE Conference on (pp. 1-8). IEEE.
8. Levy-Moonshine, A., Amir, E. A. D., & Keasar, C. (2009). Enhancement of beta-sheet
assembly by cooperative hydrogen bonds potential. Bioinformatics, 25(20), 2639-2645.
9. Krivov, G. G., Shapovalov, M. V., & Dunbrack, R. L. (2009). Improved prediction of protein
side‐chain conformations with SCWRL4. Proteins: Structure, Function, and Bioinformatics,
77(4), 778-795.

62
GOAL, LEE

Protein structure modeling by global optimization

Keehyoung Joo1,2, Seung Hwan Hong1,3, In Suk Joung1,3, Balachandran Manavalan1,3, Jose
Christian Flores Canales1,3, Qianyi Cheng1,3, Seungryong Heo1, Jong Yun Kim1, Sun Young Lee1,
Mikyung Nam1, In-Ho Lee1,4, Sung Jong Lee1,5, and Jooyoung Lee1,2,3*
1-Center for In-Silico Protein Science, Korea Institute for Advanced Study, 130-722, Korea, 2-Center for Advanced
Computation, Korea Institute for Advanced Study, 130-722, Korea, 3-School of Computational Sciences, Korea
Institute for Advanced Study, 130-722, Korea, 4-Korea Research Institute of Standards and Science (KRISS), 305-
340, Korea, 5-Department of Physics, University of Suwon, Hwaseong-Si, 445-743, Korea
*jlee@kias.re.kr

For CASP12 experiment, we have developed a new protocol based on the work flow of our
previous CASP11 protocols (nns/LEE, and LEER)[1, 2]. The protocol follows the usual
template-based modeling (TBM) procedure including template selection, multiple sequence-
structure alignments (MSA), 3D chain building, side-chain re-modeling, and refinement. For
template selection and model selection, we updated our model quality assessment (QA) method.
In 3D chain building, we updated our energy function to include restraints generated from
predicted residue-residue contacts by MetaPsicov [3], and we improved our fragment selection
method for the dynamic fragment assembly energy term. In addition, we updated our model
refinement method. We have applied the global optimization method of conformational space
annealing (CSA) to three modeling stages including MSA, 3D chain building, and side-chain re-
modeling.

Template selection: Template candidates were collected in the same way as in CASP11, which
used CRFpred (in-house machine learning method developed in CASP11), FOLDFINDER
(profile-profile alignment method using predicted secondary structure information), and HHpred
[4]. For re-ranking of templates, we used quality assessment methods combining an in-house
method called QA, a new machine called SVMQA (support vector machine for quality
assessment), and structural clustering using a community detection method. We built a fold
database containing about 36,000+ protein chains with 90% sequence identity level using CD-
HIT sequence clustering algorithm [5].

3D chain building: For the 3D modeling, we added contact energy terms utilizing contact
prediction performed by MetaPSICOV, which uses sequence co-evolution information. We
updated fragment selection method for the dynamic fragment assembly term using a new scoring
function that includes the consistency between predicted and actual values of secondary
structure, solvent accessibility, and phi/psi torsional angles.

With the above modification in the protocol, we performed 3D model optimization and
side-chain re-modeling by successively applying CSA in the GOAL server method. We note that
human prediction method LEE is identical to GOAL except that SERVER prediction models are
used as templates.

63
Quality Assessment: Support-vector-machine-based protein single-model quality assessment
(SVMQA) predicts global quality scores for a given model such as TM-score and GDT-TS score
based on the feature vector which contains statistical potential energy terms and consistency
terms between actual structural features (extracted from the three-dimensional coordinates) and
predicted values (from its primary sequence).

Model refinement: All selected models were further refined by running restrained MD
simulations. The protein model was solvated in TIP3P water box and a series of MD simulations
was performed with AMBER14SB force field. The trajectory was averaged and the structure was
finally energy minimized with an implicit solvation model.

Assisted structure prediction: For targets with provided small angle X-ray scattering data
(called Ts targets), we applied global optimization used in the 3D modeling step of GOAL
protocol with an additional energy term using chi-square fitting between provided scattering
intensity and calculated intensity from the structure. We also used template restraints when
templates were available. For targets with provided cross-linking data (called Tx targets), we
restraints were incorporate using Lorentzian energy terms.

Acknowledgements: This work was supported by the National Research Foundation of


Korea(NRF) grant funded by the Korea government (MEST) (No. 2008-0061987). We thank
Korea Institute for Advanced Study for providing computing resources (KIAS Center for
Advanced Computation Linux Cluster) for this work. The authors would like to acknowledge the
support from KISTI supercomputing center through the strategic support program for the
supercomputing application research. Finally, we thank the CoCoLink company for providing
Klimax-210 with 10 Nvidia TitanX GPUs.

1. K. Joo, I. Joung, S. Y. Lee, J. Y. Kim, Q. Cheng, B. Manavalan, J. Y. Joung, S. Heo, J. Lee,


M. Nam, I.-H. Lee, S. J. Lee, and Jooyoung Lee, Template based protein structure modeling
by global optimization in CASP11, Proteins (Early view).
2. I. Joung, S. Y. Lee, Q. Cheng, J. Y. Kim, K. Joo, S. J. Lee, and J. Lee, Template-free
modeling by LEE and LEER in CASP11, Proteins (Early view)
3. D. T. Jones, T. Singh, T. Kosciolek, and S. Tetchner, (2015). MetaPSICOV: combining
coevolution methods for accurate prediction of contacts and long range hydrogen bonding in
proteins. Bioinformatics, 31(7), 999-1006.
4. J. Söding, (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics
21, 951-960.
5. W. Li, and A. Godzik, (2006) CD-HIT: a fast program for clustering and comparing large sets
of protein or nucleotide sequences, Bioinformatics, 22, 1658-1659.

64
Grudinin

Using a novel knowledge-based distance-dependent potential derived using convex


optimization for fold recognition of CASP12 targets

S. Grudinin 1,2,3, G. Pages 1,2,3, and G. Derevyanko 4,5


1 - Univ. Grenoble Alpes, LJK, 38000 France, 2 - CNRS, LJK, 38000 France, 3 -Inria, France, 4 -
Forshungszentrum Juelich, Juelich, Germany, 5 -Institute of Structural Biology J.P. Ebel, Grenoble, France
sergei.grudinin@inria.fr

Recently we introduced protein-protein1,2 and protein-ligand3,4 knowledge-based potentials using


polynomial expansions of potential coefficients and convex optimization to deduce the unknown
variables from the knowledge-base. The distinct feature of our potentials is that we use thousands
of unknowns and a regularization term in the optimization problem to remove possible
overfitting. Here, we assessed the performance of a similar potential, called Convex-PF, derived
for protein fold recognition. More precisely, we trained the Convex-PF atom-level distance-
dependent potential and applied this for the structure prediction and refinement stages of the
CASP12 community experiment.

Methods
First, we generated the training data set consisting of 1,250 protein folds and their misfolded
structures (decoys) using the 3DRobot decoy generation method5. Then, we extracted structural
features (pair distribution functions), and formulated an optimization problem demanding that
the native structure has the lowest energy compared to all of its decoys and that the energy
separation between the native structures and their decoys is maximized. These two conditions
can be formally regarded to as misclassification and regularization. Finally, we solved the
problem using the block sequential minimization solver. We should add that we tested multiple
parameters of the method, such as the number of atom types, the effect of polar hydrogen atoms,
the effect of topological connectivity of the atoms, the effect of explicit solvation interactions,
etc. We should also add that our method compares favorably to other scoring functions, such as
Rosetta6 and RWplus7, on standard benchmarks for fold recognition (Rosetta, I-TASSER, and
Modeller).

Results
We used the trained Convex-PF potential at the structure prediction and refinement stages of
CASP12. For the structure prediction stage, we first picked three stage-2 server predictions with
the best score, then sampled the picked structures using the 3DRobot sampling engine5, and
finally re-scored all the obtained solutions (typically around few thousands) using the Convex-PF
potential. For the refinement stage, we started with 3DRobot sampling and proceeded with the
re-scoring of the predictions using the Convex-PF potential. We did not perform any manual
intervention and applied the same protocol for all the predictions.

Availability
The presented scoring potential will be made available at https://team.inria.fr/nano-d/software/.

65
1. P. Popov and S. Grudinin (2015). Knowledge of Native Protein–Protein Interfaces Is
Sufficient To Construct Predictive Models for the Selection of Binding Candidates, Journal
of Chemical Information and Modeling, 55, 2242-2255.
2. E. Neveu, D. W. Ritchie, P. Popov, and S. Grudinin (2016). PEPSI-Dock: a detailed data-
driven protein– protein interaction potential accelerated by polar Fourier correlation,
Bioinformatics, 32, i693-i701.
3. S. Grudinin, P. Popov, E. Neveu, G. Cheremovskiy (2016). Predicting Binding Poses and
Affinities in the CSAR 2013―2014 Docking Exercises Using the Knowledge-Based
Convex-PL Potential, Journal of Chemical Information and Modeling, 56, 1053–1062.
4. M. Kadukova, and S. Grudinin. Convex-PL - a novel knowledge-based potential for
protein-ligand interactions deduced from structural databases using convex optimization,
Unpublished.
5. H. Deng, Y. Jia, Y. Zhang. (2015). 3DRobot: automated generation of diverse and well-
packed protein structure decoys, Bioinformatics, 32, 378-387.
6. O’Meara et al. (2015). A Combined Covalent-Electrostatic Model of Hydrogen Bonding
Improves Structure Prediction with Rosetta, J. Chem. Theory Comput, 11, 609–622.
7. J. Zhang J, and Y. Zhang (2010). A novel side-chain orientation dependent potential derived
from random-walk reference state for protein fold selection and structure prediction. PLoS
One, 27, e15386.

66
Grudinin (SAXS-assisted)

SAXS-guided protein structure relaxation along the lowest normal modes with application
to CASP12 SAXS-assisted targets

S. Grudinin 1,2,3 and A. Hoffmann 1,2,3


1 - Univ. Grenoble Alpes, LJK, 38000 France, 2 - CNRS, LJK, 38000 France, 3 -Inria, France
sergei.grudinin@inria.fr

Small-angle scattering is one of the fundamental techniques for structural studies of biological
systems. Small-angle X-ray scattering (SAXS) is a type of small-angle scattering where X-rays
scatter elastically from the sample and are then collected at very small angles. Compared to other
structure determination methods, SAXS experiments are very simple conceptually and thanks to
advances in instrumentation1, the SAXS technique is becoming very popular in the recent years
as a complement to other methods in structural biology2. SAXS also allows to overcome some
restrictions of other experimental techniques, for example, it is applicable to all system’s sizes, it
allows to study particles in solution, it is very fast and destroys the sample only marginally.
However, on the downside, SAXS can only determine the electron density’s distance distribution
function at a supra-nm resolution. Nevertheless, SAXS is typically the method of choice for low-
resolution structural studies of proteins in solution.

Methods
We present application of Pepsi-SAXS (Pepsi stands for Polynomial Expansions of Protein
Structures and Interactions)3, an implementation of the multipole-based scheme, to the protein
structure refinement task. Overall, Pepsi-SAXS is significantly faster compared to other popular
tools such as Crysol4 and FoXS5, for example, as we recently demonstrated using an excessive
number of test cases3. To do the structure optimization, we combined Pepsi-SAXS with the non-
linear rigid block normal mode approach (NOLB)6. More precisely, we computed the derivative
of the Chi2 difference between the model and experimental scattering profiles with respect to the
amplitudes of several lowest modes, typically 10 to 25. Then, using a linear search, we computed
the step size along the chosen direction and repeated the procedure until Chi2 can improve.
Overall, we allowed 100 minimization steps, which was sufficient to significantly change the
root mean square deviation compared to the initial structure.
If the SAXS data unambiguously showed a multimeric protein state, prior to SAXS
guided relaxation we first assembled the multimeric structures using our fast Fourier transform-
accelerated symmetry assembler called SAM7. Then, we picked the top scored assemblies and
relaxed them with respect to the SAXS profiles along the lowest normal modes as described
above. Finally, we ranked all the predictions according to the Chi2 score. In a few difficult cases,
we also discriminated between the predictions additionally using the Deep Learning (DL)-based
scoring function.

Results
We assessed the performance of our method using all the SAXS-assisted targets in CASP12. If
the same target was available for the refinement task, we used the provided refinement template
as the initial structure for the subsequent relaxation. Otherwise, we exhaustively relaxed all the
150 stage-2 server predictions. For the Ts866 target, we assembled the initial structure using the

67
D6 symmetry, and for the Ts909 target, we assembled the initial structure using the C 3 symmetry.
All other initial structures were considered monomeric. For targets Ts896, Ts899, Ts901, Ts941,
and Ts947 we additionally ranked the relaxed structures using the DL-based scoring function.
Overall, our method could significantly reduce the value of Chi2 for almost all the
targets. For example, in the best Ts941 model Chi2 was reduced from 388 to 2.7. On the other
hand, in some other targets, such as Ts896 or Ts899, many of stage-2 server predictions already
had significantly low values of Chi2, so an additional scoring stage was necessarily required.

Availability
Pepsi-SAXS is available at https://team.inria.fr/nano-d/software/pepsi-saxs/.

1. Spilotros, A. and Svergun, D. I. (2014). Encyclopedia of Analytical Chemistry, 1–34.


2. Graewert, M. A. & Svergun, D. I. (2013). Current opinion in structural biology, 23, 748–
754.
3. S. Grudinin, M. Garkavenko, and A. Kazennov. Pepsi-SAXS : an adaptive method for rapid
and accurate computation of small angle X-ray scattering profiles. Acta Cryst D, in
revision.
4. Svergun, D., Barberato, C. & Koch, M. (1995). Journal of applied crystallography, 28,
768–773.
5. Schneidman-Duhovny, D., Hammel, M., Tainer, J. A. & Sali, A. (2013). Biophysical
journal, 105, 962–974.
6. A. Hoffmann, and S. Grudinin. Non-linear rigid block normal modes.
https://team.inria.fr/nano-d/software/nolb-normal-modes/.
7. D. W. Ritchie and S. Grudinin (2016). Spherical polar Fourier assembly of protein
complexes with arbitrary point group symmetry. Journal of Applied Crystallography, 49,
158-167.

68
Grudinin-DeepL

Using deep learning for fold recognition of CASP12 targets

S. Grudinin 1,2,3 and G. Derevyanko 4,5


1 - Univ. Grenoble Alpes, LJK, 38000 France, 2 - CNRS, LJK, 38000 France, 3 -Inria, France, 4 -
Forshungszentrum Juelich, Juelich, Germany, 5 -Institute of Structural Biology J.P. Ebel, Grenoble, France
sergei.grudinin@inria.fr

Deep learning (DL) is a popular approach in the field of machine learning, which recently
gained momentum and sparked a lot of interest in the research community1, particularly in
computer vision and image recognition. Unlike previous “shallow” approaches, DL tries to learn
hierarchical representation of the data in hand. It alleviates the need for feature extraction and the
‘curse of dimensionality’ that limited performance of “shallow” approaches. Recently, DL was
applied to biological data and yielded remarkable results in the human splicing code prediction2,
identification of DNA- and RNA-binding motifs3 and predicting the effects of non-coding DNA
variants at single nucleotide polymorphism precision4.
Here we trained a DL atom-level model and applied this for the quality assessment,
structure prediction and refinement stages of the CASP12 community experiment.

Methods
We present one of the first application of deep convolutional networks to the problem of protein
fold recognition. More precisely, we generated the data set consisting of 1,250 protein folds and
their misfolded structures (decoys) using the 3DRobot decoy generation method5. The network
architecture has 23 layers consisting of 3D convolution and maximum pooling with the rectified
linear unit (ReLU) nonlinearity, followed by three fully-connected layers. To prevent overfitting,
we used the L1 regularization. The network was trained using the stochastic gradient descent
method using a loss function aiming to maximize correlation between the score and root-mean-
square deviation (RMSD) of the decoys. We should specifically mention that we did not
explicitly extract descriptors from the protein structures.

Results
First, we assessed the prediction quality of our model using three popular benchmarks for fold
recognition (Rosetta, I-TASSER, and Modeller) and demonstrated that our approach produces
results similar or better to other knowledge-based potentials.
Then, we applied our DL model for the quality assessment (QA) task of stage-1 and
stage-2 predictions of CASP12. More precisely, we blindly scored all the submitted models and
then rescaled the obtained results to a [0;1] interval. We did not produce residue-based scores,
since the DL model only evaluates the whole protein structure. We did not perform any manual
intervention , did not optimize the provided structures, and applied the same protocol for all the
QA predictions.
Finally, we used the trained DL model at the structure prediction and refinement stages of
CASP12. For the structure prediction stage, we first picked three stage-2 server predictions with
the best DL score, then sampled the picked structures using the 3DRobot sampling engine 5, and
finally re-scored all the obtained solutions (typically around few thousands) using the DL score.

69
For the refinement stage, we started with 3DRobot sampling and proceeded with re-scoring the
predictions using the DL score. This can be considered as a physics-based sampling approach
followed by a knowledge-based scoring approach. We did not perform any manual intervention
and applied the same protocol for all the predictions.

Availability
The scoring method will be made available at https://team.inria.fr/nano-d/software/.

1. Y. LeCun, Y. Bengio, G. Hinton (2015). Deep learning, Nature, 521, 436-444.


2. H. Y. Xiong, . Q. Morris, The human splicing code reveals new insights into the genetic
determinants of disease, Science, 347, 1254806.
3. B. Alipanahi, A. Delong, M. T. Weirauch, B. J. Frey. (2015). Predicting the sequence
specificities of DNA-and RNA-binding proteins by deep learning, Nature biotechnology,
33, 831–838.
4. J. Zhou, O. G. Troyanskaya. (2015). Predicting effects of noncoding variants with deep
learning-based sequence model, Nature methods, 12, 931-934.
5. H. Deng, Y. Jia, Y. Zhang. (2015). 3DRobot: automated generation of diverse and well-
packed protein structure decoys, Bioinformatics, 32, 378-387.

70
InnoUNRES

Use of Modified UNRES force field and replica-exchange molecular dynamics in physics-
based template-free prediction of protein structures

E. Lubecka1,2, A.G. Lipska1, A.K. Sieradzan1*


1 - Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland, 2 -Institute of
Informatics, University of Gdansk, Wita Stwosza 57, 80-308 Gdansk, Poland
adasko@sun1.chem.univ.gda.pl

Previous CASP experiments have shown that currently, physics-based approaches are less
efficient than knowledge-based approaches in the prediction of proteins structure; however, their
advantage is independence of structural databases. We used a coarse-grain force field, as use of
all-atom force fields is impractical in de novo simulations of protein structure due to excessive
time and huge computer resources required. On the other hand, de novo simulations of even large
proteins are feasible with coarse-grained force field that use highly reduced representation of
polypeptide chains resulting in low resolution of acquired models.
In the last several years, we have been developing the physics-based united-residue
(UNRES) force field for physics-based prediction of protein structures and large-scale
simulations of protein folding, together with a variety of methods for searching the
conformational space1. Recently we introduced various improvements in UNRES. In the present
CASP experiment we have tested a modified version of physics-based coarse-grain force field
(UNRES).

Methods
1
In the UNRES model , a polypeptide chain is represented by a sequence of alpha-carbon atoms
connected by virtual bonds with attached side chains. Two interaction sites are used to represent
each amino acid: the united peptide group (p), located in the middle between two consecutive
alpha-carbon atoms, and the united side chain (SC). The interactions of this simplified model are
described by the UNRES potential derived from the generalized cluster-cumulant expansion of a
restricted free energy (RFE) function of polypeptide chains. The cumulant expansion enabled us
to determine the functional forms of the multibody terms in UNRES. The effective energy
2
function depends on temperature . There are three most significant differences between the
original UNRES3 force and the modified force field are as follows: (i) introduction of periodic-
boundary conditions4, which enables us to handle multimeric proteins, (ii) introduction of
restraints on the backbone virtual-bond valence angles from secondary-structure prediction and
(iii) modification of the energy term accounting for peptide-group - peptide group interactions.
We introduced a shielding function, which modifies the strength of the interactions between
peptide-group dipoles depending on their screening from the solvent by side chains. The strength
of peptide-group – peptide-group interaction was linearly proportional to volume of first
hydration sphere occupied by side-chains.
The structures of the target proteins were predicted by the following four-stage
procedure. First, UNRES was employed to carry out Multiplexed Replica Exchange Molecular

71
Dynamics (MREMD)5 for target proteins. To speed up the search and improve accuracy,
restraints were imposed on secondary structure based on secondary structure prediction by
PSIPRED6. Those restrains were imposed both on torsional and valence bond angle. The
strength of those restrains was proportional to PSIPRED score. Second, based on MREMD
simulation results, Weighted-Histogram Analysis Method (WHAM) was used to calculate
relative free energy of each structure of the last section of MREMD simulation1. Third, cluster
analysis was employed to cluster the structures from an MREMD simulation. Five clusters with
the lowest free energies were chosen as prediction candidates. Finally, in the fourth stage, the
conformations closest to the respective average structures corresponding to the found clusters
were converted to all-atom structures using the PULCHRA7 and SCWRL8 algorithms. These
all-atom structure were submitted to the CASP website.

Results
We postpone the assessment of the approach until the official release of CASP12 results.

Availability
The UNRES package is available at www.unres.pl.

1. Liwo,A., Czaplewski,C., Ołdziej,S., Rojas,A.V., Kaźmierkiewicz,R., Makowski,M.,


Murarka, R.K., Scheraga,H.A. (2008) Simulation of protein structure and dynamics with the
coarse-grained UNRES force field. In: Coarse-Graining of Condensed Phase and
Biomolecular Systems., ed. G. Voth, Taylor & Francis, Chapter 8, pp. 107-122.
2. Liwo,A., Khalili,M., Czaplewski,C., Kalinowski,S., Ołdziej,S., Wachucik,K., Scheraga,H.A.
(2007) Modification and optimization of the united-residue (UNRES) potential energy
function for canonical simulations. I. Temperature dependence of the effective energy
function and tests of the optimization method with single training proteins. J. Pys. Chem. B
111, 260-285.
3. Sieradzan, A.K., Krupa,P., Scheraga,H.A., Liwo,A., Czaplewski,C. (2015) Physics-based
potentials for the coupling between backbone- and side-chain-local conformational states in
the united residue (UNRES) force field for protein simulations. J. Chem. Theory Comput.,
11, 817-831.
4. Sieradzan A.K (2015) Introduction of periodic boundary conditions into UNRES force field.
J. Comput Chem, 36, 940-946
5. Czaplewski,C., Kalinowski,S., Liwo,A., Scheraga,H.A. (2009) Application of multiplexed
replica exchange molecular dynamics to the UNRES force field: Tests with α and α+β
proteins. J Chem. Theory Comput. 5, 627-640.
6. McGuffin,L.J., Bryson,K., Jones,D.T. (2000) The PSIPRED protein structure prediction
server. Bioinformatics, 16, 404-405.
7. Rotkiewicz,P., Skolnick,J. (2008) Fast procedure for reconstruction of full-atom protein
models from reduced representations. J. Comput. Chem., 29, 1460-1465.
8. Wang,Q., Canutescu,A.A., Dunbrack,R.L. (2008) SCWRL and MolIDE: Computer Programs
for Side-Chain Conformation Prediction and Homology Modeling. Nat. Protoc. 3,1832-1847.

72
HHGG

HH guide to the Galaxy: collaborative protein structure prediction by HHpred and


GalaxyRefine

Lim Heo1,†, Martin Steinegger1,2, †, Johannes Söding2,*, and Chaok Seok1,*


1 Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea, 2 Quantitative and
Computational Biology group, Max-Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen,
Germany
† These two authors contributed equally to this work.
* correspondence to soeding@mpibpc.mpg.de and chaok@snu.ac.kr

“HH guide to the Galaxy” (HHGG), combines HHpred1 a fast structure prediction method and
GalaxyRefine2,3 a physic based refinement server. HHpred uses pairwise comparison of profile
hidden Markov models (HMMs) to detect and align homologous protein templates to predict
structure. HHpred was among the best templated based modeling servers at CASP9 and
CASP10. GalaxyRefine has demonstrated consistent improvements in model qualities, and it was
assessed as the best model refinement server in CASP11. In CASP12, effectiveness of combining
the two methods for protein structure prediction has been tested.

Methods
HHpred (server version HHpredA) processes sequences in six consecutive steps (1) iterative
protein search against a clustered Uniprot database using HHblits4, (2) search templates in the
PDB70 database using HHsearch5, (3) rerank templates, (4) compute probabilistic distant
restraints from multipe templates, (5) run MODELLER6 and (6) optimize side chains using
SCWRL47.
The models returned by HHpred were refined by using an updated version of
GalaxyRefine2,3. In particular, the energy function was updated by replacing the dipolar-
DFIRE8 potential with a new statistical potential9 that considers solvation states of interacting
atoms as well as their distances and orientations. Side chains were first optimized based on a
graph theory algorithm similar to SCWRL4, but with GALAXY energy. Both backbone and side
chains were then relaxed by repetitive side chain repacking and molecular dynamics simulations.
Five lowest-energy models out of those generated by 24 independent trajectories were submitted
as model 1–5.

Results
The protocol was benchmarked on CASP10/11 single-domain TBM targets (with the best GDT-
TS > 40); it was examined by refining models submitted by HHpred. When model 1 was
considered, model quality was improved by 0.76, 2.08, and 1.73 in GDT-HA, GDC-SC, and
MolProbity score, respectively. GalaxyRefine could consistently improve the models in 74%,
73%, and 100% of targets in terms of the three quality measures. The model quality

73
improvements by refinement depended little on protein size, fold, and the initial model quality.
Therefore, the HHGG protocol seems to be generally applicable and useful for high-accuracy
protein structure prediction.

Availability
HHpred is available at its web server (http://toolkit.tuebingen.mpg.de/hhpred) or as a standalone
version of HHsuite (https://github.com/soedinglab/hh-suite). GalaxyRefine is available at its web
server (http://galaxy.seoklab.org/refine) or as a standalone version
(http://seoklab.github.io/GalaxyRefine).

1. Hildebrand,A., Remmert,M., Biegert,A. & Söding,J. (2009) Fast and accurate automatic
structure prediction with HHpred. Proteins 77 (suppl 9), 127-132.
2. Lee,G.R., Heo,L. & Seok,C. (2015) Effective protein model structure refinement by loop
modeling and overall relaxation. Proteins in press.
3. Heo,L., Park,H. & Seok,C. (2013) GalaxyRefine: Protein structure refinement driven by side-
chain repacking. Nucleic Acids Res. 41 (W1), W384-W388.
4. Remmert,M., Biegert,A., Hauser,A. & Söding,J. (2012) HHblits: lightning-fast iterative
protein sequence searching by HMM-HMM alignment. Nature Method 9, 173-175.
5. Söding,J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21
(7) 951-960.
6. Sali,A. & Blundell,T.L. (1993) Comparative protein modeling by satisfaction of spatial
restraints. J. Mol. Biol. 234, 779-815.
7. Krivov,G.G., Dunbrack,R.L. & Shapovalov,M.V. (2009) Improved prediction of protein side-
chain conformations with SCWRL4. Proteins 77, 778-795.
8. Yang,Y. & Zhou,Y. (2008) Specific interactions for ab initio folding of protein terminal
regions with secondary structures. Proteins 72, 793-803.
9. Heo,L. & Seok,C. A new statistical potential with consideration of solvation effects for
protein simulations. in preparation.

74
iFold_1, iFold_2, Deepfold-Contact, Deepfold-Boom, naive

Protein Contact Prediction with Deep Fully Convolutional Neural Network

Yang Liu, Qing Ye, Jian Peng


University of Illinois, Urbana Champaign
jianpeng@illinois.edu

Substantial progress has been made in protein contact map prediction. However, most current
methods for contact map prediction1-6 may not predict reliable contact pairs when the quality of
the input multiple sequence alignment (MSA) is poor. Here we present a new deep learning-
based contact prediction algorithm, aiming to capture patterns in long-range evolutionary
couplings and improve the predictive performance.

Methods
Based on the MSA of an input sequence, we first clustered sequences in the MSA according to
their gap patterns in the alignment, and segment the input protein into domains. We then
computed various 1D and 2D features for each domain. The 1D features included the following:
column-wise amino acid composition from the MSA, secondary structure predicted by
PSIPRED7 and solvent accessibility predicted by SOLVPRED5. The 2D features included co-
evolutionary patterns by CCMPred and EVfold. Features of domains were concatenated together
and used as input to a convolutional neural network. The neural network included two 1D
convolutional layers at the bottom with window sizes of 5 and 7, respectively. The output of the
top 1D layer was concatenated with the 2D co-evolutionary features. Nine convolutional layers,
each with 64 5x5-size filters, were stacked to predict contact of each residue pair. In the
experiment, we tried different strategy combinations including four servers and one human
group. Servers Naive and iFold_2 did not split the input sequences into domains and predicted
contact map with two models trained on different datasets, respectively. DeepFold-Contact and
DeepFold-Boom first predicted domain boundaries and then applied HHBlits and JackHMMER,
respectively, for each domain. Our human group selected between HHBlits and JackHMMER
according to the following procedure. For each input protein, we first used HHBlits (with an e-
value threshold of 1e-3) to generate a multiple sequence alignment. If the MSA has fewer than
1000 sequences, we instead used JackHMMER (with an e-value threshold of 10) to generate a
larger alignment.

Results
For model training, we constructed a filtered dataset based on the ASTRAL-2.06 database8. Two
residues are considered as a contact pair if their Cβ-Cβ distance is smaller than a predetermined
threshold. We tried five different thresholds: 7.0Å, 7.5 Å, 8.0 Å, 8.5 Å and 9.0 Å for training.
Models with different thresholds were later averaged for consensus.
The accuracy of our server is evaluated locally on three datasets including CAMEO,
CASP10/11 and an in-house benchmark based on the ASTRAL-2.06 database. To evaluate our
method, only pairs of residues with distances smaller than 8.0Å were considered as contact pairs.
We compared the mean precision of the top L, L/2, L/5 and L/10 predictions on both medium-
range (12≤ distance≤ 23) and long-range (distance≥24) contact pairs. Our deep learning-based

75
method significantly outperforms CCMPred.

Medium Range: Long Range:


12 ≤|i - j| ≤ 23 |i - j| ≥24

Dataset #protein Method L L/2 L/5 L/10 L L/2 L/5 L/10

Astral Dataset CCMPred 16.24 25.50 41.19 52.11 31.32 43.56 55.92 61.95
902
Our Method 30.09 48.07 69.21 79.26 55.30 69.57 80.21 84.47

CASP10+CASP11 CCMPred 19.43 28.23 42.95 51.85 27.79 37.95 47.90 52.66
226
Our Method 33.59 49.08 65.80 73.52 44.34 56.39 66.06 70.03

CAMEO CCMPred 10.80 15.89 25.66 33.41 17.98 25.14 33.51 39.07
220
Our Method 23.85 35.58 50.60 59.34 34.39 44.86 55.11 59.47

1. Morcos,F., Pagnani,A., Lunt,B., Bertolino,A., Marks,D.S., Sander,C., ... & Weigt,M. (2011).
Direct-coupling analysis of residue coevolution captures native contacts across many protein
families. Proc. Natl. Acad. Sci., 108, E1293-E1301.
2. Kamisetty,H., Ovchinnikov,S., & Baker,D. (2013). Assessing the utility of coevolution-based
residue–residue contact predictions in a sequence-and structure-rich era. Proc. Natl. Acad.
Sci. ,110, 15674-15679.
3. Seemayer,S., Gruber,M., & Söding,J. (2014). CCMpred—fast and precise prediction of protein
residue–residue contacts from correlated mutations. Bioinformatics, 30, 3128-3130.
4.Wang,S., Li,W., Zhang,R., Liu,S., & Xu,J. (2016). CoinFold: a web server for protein contact
prediction and contact-assisted protein folding. Nuc. Acids Res., gkw307.
5. Jones,D.T., Singh,T., Kosciolek,T., & Tetchner,S. (2015). MetaPSICOV: combining
coevolution methods for accurate prediction of contacts and long range hydrogen bonding in
proteins. Bioinformatics, 31, 999-1006.
6. Jones,D.T., Buchan,D.W., Cozzetto,D., & Pontil,M. (2012). PSICOV: precise structural
contact prediction using sparse inverse covariance estimation on large multiple sequence
alignments. Bioinformatics, 28, 184-190.
7. McGuffin,L.J., Bryson,K., & Jones,D.T. (2000). The PSIPRED protein structure prediction
server. Bioinformatics, 16, 404-405.
8. Fox,N.K., Brenner,S.E., & Chandonia, J. M. (2014). SCOPe: Structural Classification of
Proteins—extended, integrating SCOP and ASTRAL data and classification of new
structures. Nuc. Acids Res., 42, D304-D309.

76
IntFOLD4

Fully Automated Prediction of Protein Tertiary Structures with Local Model Quality
Scores Using the IntFOLD4 Server

L.J. McGuffin
School of Biological Sciences, University of Reading, Reading, UK
l.j.mcguffin@reading.ac.uk

The IntFOLD server1 integrates our latest methods for: tertiary structure (TS) prediction, domain
boundary prediction, prediction of intrinsically disordered regions, prediction of protein-ligand
interactions and the global and local quality assessment (QA) of predicted 3D models of
proteins. Our focus for the IntFOLD4 TS protocol was the improvement of global model ranking
and local model quality scoring, using the ModFOLD6_rank method, as well as the integration
of several new methods for generating alternative sequence-structure alignments and/or
constructing reference 3D models.

Methods
For CASP12, a bespoke version of the IntFOLD4 server was developed in order to return
appropriately formatted results for just the tertiary structure (TS) prediction category.
Additionally, the local quality assessment predictions (QMODE3) were returned as scores in the
B-factor column of each TS model file. (Predictions in the QMODE1 & QMODE2 QA
categories were also returned by our separate servers (see our ModFOLD6 and ModFOLDclust2
abstracts for details.)
Our TS method was developed with the aim of fixing local errors, identified in an initial
pool of single template models, through iterative multi-template modeling. The method attempts
to exploit our previous CASP successes in accurately predicting local errors in our models2 by
taking the global and local per-residue errors into consideration during the multiple template
selection stage3.
For the initial fold recognition stage, 14 different methods were installed and run in-
house to generate up to 10 sequence-to-structure alignments each - resulting in up to 140
alternative single-template based models being generated for each CASP target. The following
fold recognition methods were used: SP34, SPARKS24, HHsearch5, COMA6, SPARKSX7,
CNFsearch8 and the 8 alternative threading methods that are integrated into the current
LOMETS package9 (PPA, dPPA, dPPA2, sPPA, MUSTER, wPPA, wdPPA and wMUSTER).
In the first stage of the IntFOLD4 TS method, all single-template models were assessed
using ModFOLDclust210 in order to assign global and local model quality scores. Using the
single template model quality scores, and other criteria involving template coverage, the
corresponding alignments were then selected from each fold recognition method and used to
build multiple-template models, with the aim of minimizing local errors in the final models. The
alternative multi-template modelling alignment selection methods resulted in the generation of a
new population of up to 124 multi-template models for each target. Additionally, I-TASSER light
11 (for sequence <500 residues; run in “light mode” with wall-time restricted to 5h) and

77
HHpred12 were used to generate 3 models each, which were then added to the final pool of
alternative multi-template models for ranking. In the final stage of the method, the ~130 models
in the final reference set were then evaluated using our ModFOLD6_rank QA method and the
top 5 ranked models were submitted as the final IntFOLD4 TS predictions (see our ModFOLD6
abstract from more details about our ModFOLD6_rank method).

Results
The IntFOLD4 server is continuously benchmarked using the CAMEO server13 (identified as
server58). The IntFOLD4 method has been independently verified to be an improvement on our
previous methods (IntFOLD2 & IntFOLD3). At the time of writing, IntFOLD4 ranks as the 2 nd
best 3D server according to the CAMEO lDDT scores based on pairwise comparisons (it is
outperformed by only 1 public server in the benchmark).

Availability
The IntFOLD4 server is available at:
http://www.reading.ac.uk/bioinf/IntFOLD/IntFOLD4_form.html

1. McGuffin,L.J., Atkins,J., Salehe,B.R., Shuid,A.N., Roche,D.B. (2015) IntFOLD: an


integrated server for modelling protein structures and functions from amino acid sequences.
Nucleic Acids Res. 43, W169-73.
2. McGuffin,L.J., Roche,D.B. (2011) Automated tertiary structure prediction with accurate local
model quality assessment using the IntFOLD-TS method. Proteins. 79 S10, 137-46.
3. Buenavista,M.T., Roche,D.B., McGuffin,L. J. (2012) Improvement of 3D protein models
using multiple templates guided by single-template model quality assessment.
Bioinformatics. 28, 1851-1857.
4. Zhou,H., Zhou,Y. (2005) SPARKS2 and SP3 servers in CASP6. Proteins. 61 S7, 152-156.
5. Söding,J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics.
21, 951-96.
6. Margelevičius,M., Venclovas,Č. (2010) Detection of distant evolutionary relationships
between protein families using theory of sequence profile-profile comparisons. BMC
Bioinformatics. 11, 89.
7. Yang,Y., Faraggi,E., Zhao,H., Zhou,Y. (2011) Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of query and corresponding native properties of templates.
Bioinformatics. 27, 2076-2082.
8. Ma,J., Wang,S., Zhao,F., Xu,J. (2013) Protein threading using context-specific alignment
potential. Bioinformatics. 29, i257-65.
9. Wu,S. and Zhang,Y. (2007) LOMETS: A local meta-threading-server for protein structure
prediction. Nucleic Acids Research. 35, 3375-3382.
10. McGuffin,L.J. & Roche,D.B. (2010) Rapid model quality assessment for protein structure
predictions using the comparison of multiple models without structural alignments.
Bioinformatics. 26, 182-188.
11. Roy,A., Kucukural,A., Zhang,Y. (2010) I-TASSER: a unified platform for automated protein
structure and function prediction. Nature Protocols. 5, 725-738.
12. Meier,A., Söding,J. (2015) Automatic Prediction of Protein 3D Structures by Probabilistic

78
Multi-template Homology Modeling. PLoS Comput Biol. 11, e1004343.
13. Haas,J., Roth,S., Arnold,K., Kiefer,F., Schmidt,T., Bordoli,L. & Schwede,T. (2013) The
Protein Model Portal- a comprehensive resource for protein structure and model information.
Database (Oxford). bat031.

79
Jones-UCL

Protein structure prediction using FRAGFOLD and MetaPSICOV

D.T. Jones
Bioinformatics Group, Department of Computer Science, University College London, Gower St., London, WC1E
6BT, United Kingdom
d.t.jones@ucl.ac.uk, URL: http://bioinf.cs.ucl.ac.uk

The Jones-UCL group's main prediction efforts were aimed at generating models for the harder
prediction targets using contact-assisted prediction methods; if there were obvious homologous
matches with target domains a simple meta prediction method based on our BioSerf method was
used.

Methods
For CASP12, those target domains which we believed could not be reliably predicted using fold
recognition methods (such as pGenTHREADER[1]), we used FRAGFOLD [2] to generate up to
5 structures. This approach to protein tertiary structure prediction is based on the assembly of
recognized supersecondary structural fragments taken from highly resolved protein structures
using a simulated annealing algorithm. The current release of FRAGFOLD, version 4.9, was
very similar to the version used in CASP11. Between 1000 and 3000 structures were generated
for each target domain using UCL's Legion supercomputer, and a simple rigid-body structural
clustering algorithm used to select the models representing the largest clusters of conformations.
One small addition to the protocol for CASP12 was an experimental approach of seeding
some of the FRAGFOLD runs using distance geometry generated models rather than random
conformations. This embedding procedure, loosely based on the approach of Aszódi et al. [3],
was only used for beta-sheet containing targets where some predicted contacts were available.
Contact predictions were generated using the MetaPSICOV method [4].
Submitted predictions were made using little or no human intervention apart from initial
domain assignment and preparation of input secondary structure and sequence alignment files.

Results
Predictions of folds were submitted for all suitable targets (190 models submitted in total).

1. Lobley, A., Sadowski, M.I. and Jones, D.T. (2009) pGenTHREADER and pDomTHREADER:
new methods for improved protein fold recognition and superfamily discrimination,
Bioinformatics, 25, 1761-1767.
2. Jones D.T. (1997) Successful ab initio prediction of the tertiary structure of NK-Lysin using
multiple sequences and recognized supersecondary structural motifs. PROTEINS. Suppl. 1,
185-191.
3. Aszódi, A, Gradwell, MJ, Taylor, WR (1995) Global Fold Determination from a Small
Number of Distance Restraints, J. Mol. Biol. 251, 308-326.
4. Jones, DT, Singh, T, Kosciolek, T, & Tetchner, S. (2015) MetaPSICOV: combining
coevolution methods for accurate prediction of contacts and long range hydrogen bonding in
proteins. Bioinformatics. 31, 999-1006.

80
KIAS-GDANSK

Protein structure prediction with the UNRES force field aided by knowledge-based
information

A.S. Karczyńska1, M.A. Mozolewska1, P. Krupa1 , K. Bojarski1, B. Zaborowski1, A. Liwo1, R.


Ślusarz1, M. Ślusarz1, J. Lee2, K. Joo2,3 and C. Czaplewski1*
1 - Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland, 2 - Center for In Silico
Protein Structure, Korea Institute for Advanced Study, 87 Hoegiro, Dongdaemun-gu, 130-722 Seoul, Republic of
Korea, 3 - Center for Advanced Computations, Korea Institute for Advanced Study, 87 Hoegiro, Dongdaemun-gu,
130-722 Seoul, Republic of Korea
cezary.czaplewski@ug.edu.pl

Knowledge-based methods are, at present, the most effective tools for the prediction of protein
structures; however, their results heavily depend on the similarity of a target sequence to those of
proteins with known structures. On the other hand, the physics-based methods are still less
accurate and more expensive to execute; however, they are independent of databases and give
reasonable results where the knowledge-based methods fail because of weak sequence similarity.
Therefore, a plausible approach seems to be using knowledge-based methods to determine the
parts of the structures that correspond to sufficient sequence similarity and physics-based
methods to determine the remaining structure. In the current CASP experiment, we tested such a
hybrid approach, using the coarse-grained United Residue (UNRES) force field as the physics-
based component.

Methods
In the UNRES model1, a polypeptide chain is represented by a sequence of alpha-carbon
[C(alpha)]atoms connected by virtual bonds with attached side chains. Only two interaction sites
are used to represent each amino-acid residue: the united peptide group (p), located in the middle
between two consecutive C(alpha) atoms, and the united side chain group (SC). The UNRES
potential-energy function was derived from the generalized cluster-cumulant expansion of a
restricted free energy (RFE) function of polypeptide chains, which enable us to determine the
functional forms of the multibody terms in UNRES. The effective energy function is temperature
dependent and has been parameterized to reproduce structure and thermodynamics of selected
training proteins2,3.
In the first stage of the prediction procedure, a penalty function corresponding to the
restraints, and pseudopotentials of the Dynamic Fragment Assembly (DFA) approach4 were
determined and added to the UNRES energy function. To obtain the restraints on the
C(alpha)...C(alpha) distances and on the backbone-virtual-bond dihedral angles, we used a
procedure developed in our recent work5, in which a total of 20 top models were selected from
the 4 servers that performed very well in the CASP11 exercise (BAKER-ROSETTA, Zhang-
server, Quark, and GOAL, 5 top models from each server). Subsequently, common fragments
were selected from the models and restraints were imposed only on the corresponding parts of
the structures.

81
The penalty function corresponding to each restraint was a multimodal function similar to
that of the MODELLER program6; each minimum corresponded to a given model. The DFA
pseudopotentials were determined for each target by using the procedure developed by Sasaki et
al.4 further modified by Lee7.
For the refinement targets, the fragments corresponding to well-defined secondary
structure were selected from the template provided and restraints were derived from these
fragments. For the assisted targets, additional restraints were added, which were C(alpha)-
C(alpha) distance restraints for crosslinking-experiment assisted targets and a maximum-
likelihood term corresponding to the difference of the calculated distance distribution and the
experimental distance distribution.
In the second stage, the coarse-grained Multiplexed Replica Exchange Molecular
Dynamics (MREMD)8 simulations in the UNRES force field supplemented with the penalty
function corresponding to restraints and DFA pseudopotentials, were run. In the next step, the
Weighted-Histogram Analysis Method (WHAM) was used to calculate relative free energy of
each structure of the last slice of the MREMD simulation2. Subsequently, a cluster analysis was
used to obtain five clusters with the lowest free energies. The conformations closest to the
respective average structures corresponding to the found clusters (cluster centroids) were
converted to all-atom structures and refined by using restrained molecular dynamics simulations
with the AMBER12 all-atom force field to obtain the models which were subsequently
submitted.

Results
We postpone the assessment of the approach until the official release of CASP12 results.

Availability
The UNRES package is available at www.unres.pl.

1. Liwo,A., Czaplewski,C., Ołdziej,S., Rojas,A.V., Kaźmierkiewicz,R., Makowski,M.,


Murarka, R.K. & Scheraga,H.A. (2008) Simulation of protein structure and dynamics with
the coarse-grained UNRES force field. ed. G. Voth, Taylor & Francis, Chapter 8, pp. 107-
122.
2. Liwo,A., Khalili,M., Czaplewski,C., Kalinowski,S., Ołdziej,S., Wachucik,K. &
Scheraga,H.A. (2007) Modification and optimization of the united-residue (UNRES)
potential energy function for canonical simulations. I. Temperature dependence of the
effective energy function and tests of the optimization method with single training proteins.
J. Pys. Chem. B 111, 260-285.
3. Zaborowski,B., Jagieła,D., Czaplewski,C., Hałabis,A., Lewandowska,A., Żmudzińska,W.,
Ołdziej,S., Karczyńska,A., Omieczynski,C., Wirecki,T. & Liwo.A. (2015) A maximum-
likelihood approach to force-field calibration. J. Chem. Inf. Model. 55, 2050-2070.
4. Sasaki, T. N. & Sasai, M. (2004). A coarse-grained langevin molecular dynamics approach to
protein structure reproduction. Chemical Physics Letters 402, 102-106.
5. Krupa, P., Mozolewska, M. A., Joo, K., Lee, J., Czaplewski, C. & Liwo, A. (2015) Prediction
of protein structure by template-based modeling combined with the UNRES force field. J.
Chem. Inf. Model., 55, 1271–1281.

82
6. Fiser,A., Sali,A. (2003) “MODELLER: generation and refinement of homology-based
protein structure models.”. In Carter, C.W. and Sweet, R.M., editors, Methods in Enzymology,
374, 463–493, Academic Press, San Diego.
7. Lee,J, Lee,J., Sasaki,T.N., Sasai,M., Seok,C. & Lee,J. (2011) De novo protein structure
prediction by dynamics fragment assembly and conformational space annealing. Proteins:
Struct. Func. Bioinfo., 79, 2403-2417.
8. Czaplewski,C., Kalinowski,S., Liwo,A. & Scheraga,H.A. (2009) Application of multiplexed
replica exchange molecular dynamics to the UNRES force field: Tests with α and α+β
proteins. J Chem. Theory Comput. 5, 627-640.

83
Kiharalab

Structure Prediction, Refinement, and Quality Assessment with PRESCO, MD, and GMQ

Daisuke Kihara1,2, Hyungrae Kim1, Genki Terashi1, and Woong-Hee Shin1


1-Department of Biological Science, 2-Computer Science, Purdue University, West Lafayette, IN, USA

We submitted models in three categories of tertiary structure prediction to CASP12. They are
regular (TS), refinement (TR) and quality assessment (QA).

Methods

1. Regular targets for structure prediction


We selected some server models as a start of the modeling procedure. Following the procedure1
our group has used in CASP11, models were selected with the PRESCO residue environment
potential2 and a secondary structure fragment packing potential developed recently by our group.
PRESCO searches similar residue environments observed in a query model against a set of
representative native protein structures to quantify how native-like surrounding side-chain
positions and main-chain conformations are.
Then models were refined with the CABS coarse-grained protein structure model3. In addition
to the potential implemented in CABS, we added secondary structure fragment-fragment
interactions in to the CABS potential and the move set, so that fragment packing is better
modeled.
Side-chain optimization was done by an in-house integrated routine that consists of a context
based side-chain library (Oscar-star) and CHARMM molecular mechanics simulation.

2. TR target Refinement
To refine a model protein structure, hydrogen atoms were added to the template and the resulting
structure was minimized with CHARMM force field with an implicit solvent model. Minimized
structure was equilibrated using restrained molecular dynamics (MD). Independent MD
trajectories with restrained C-alpha atoms were generated using CHARMM force field with an
implicit solvent model. Models from trajectories were evaluated by DFIRE3 and GOAP4 scores
and GDT-HA values from initial models. For each target, about 2,000 models were selected by
out filtering criteria. The selected models were averaged and then relaxed by a short MD
simulation.

3. QA Quality Assessment
For QA, we used our in-house QA method, GMQ, which is a single model QA method (i.e. does
not consider similarity of the model with other submitted models). To make quality prediction of
a residue in a model, GMQ considers various features of the residue as well as predicted quality
of neighboring residues that are close in space as an input of a machine learning method. GMQ
predicts local quality. For predicting the global quality, predicted local quality of each residue is
averaged.

Acknowledgements

84
We are grateful for ITaP Research Computing at Purdue University for providing us additional
computational resouce for this project. This work was partly supported by the National Institute
of General Medical Sciences of the National Institutes of Health (R01GM097528) and the
National Science Foundation (IIS1319551, DBI1262189, IOS1127027).

1. Kim, H. Kihara, D. (2015) Protein structure prediction using residue- and fragment-
environment potentials in CASP11. Proteins: Structure, Function, and
Bioinformatics, [Epub] doi: 10.1002/prot.24920
2. Kim, H. Kihara, D. (2014) Detecting local residue environment similarity
for recognizing near-native structure models. Proteins: Structure, Function, and
Bioinformatics, 82: 3255-3272.
3. Kolinski, A. (2004). Protein modeling and structure prediction with a reduced
representation. Acta Biochimica Polonica 51, 349-371
4. Zhou, H., Zhou, Y. (2002). Distance-scaled, finite ideal-gas reference state improves structure-
derived potentials of mean force for structure selection and stability prediction. Protein Sci.
11, 2714-2726.
5. Zhou, H., Skolnick, J. (2011). GOAP: a generalized orientation-dependent, all-atom statistical
potential for protein structure prediction. Biophys. J. 101: 2043-2052.

85
KF-Consensus

Eshel Faraggi 1, Andrzej Kloczkowski 2


1- IUPUI, 2- OSU, NCH

We have participated in the Critical Assessment of Protein Structure Prediction (CASP)


experiment with four prediction procedures. The procedure described in this abstract is labeled as
group "KF-Consensus", number 420. This method is based on new version of the Seder program
[1,2], with new and improved input features as will be described in an upcoming manuscript. For
this procedure predictions from hard and soft trained Seder were combined. We used both mixed
weights and averaged prediction approaches. We use CASP5 through CASP10 server models for
training data, and CASP11 server models as a test set. That is, in this case we train over all CASP
targets and optimize the prediction on both hard and soft CASP11 targets. This version of Seder
is then used to pick among stage 2 clustered CASP targets. To estimate the B-factors, for the
protein models we used the following equation: B-factor = 300 * SPXASA / ( 1 + model-residue-
depth), with SPXASA the SPINE-X [3] predicted accessible surface area, and model-residue-
depth is the residue depth reported from the program DEPTH [4]. This model came from
approximately fitting a distribution of experimental B-factors.

1. Faraggi, Eshel, and Andrzej Kloczkowski. "A global machine learning based scoring function
for protein structure prediction." Proteins: Structure, Function, and Bioinformatics 82.5
(2014): 752-759.
2. Manuscript in preparation.
3. Faraggi, Eshel, et al. "SPINE X: improving protein secondary structure prediction by multistep
learning coupled with prediction of solvent accessible surface area and backbone torsion
angles." Journal of computational chemistry 33.3 (2012): 259-267.
4. Tan, Kuan Pern, Raghavan Varadarajan, and Mallur S. Madhusudhan. "DEPTH: a web server
to compute depth and predict small-molecule binding cavities in proteins." Nucleic acids
research 39.suppl 2 (2011): W242-W248.

86
Kloczkowski

Eshel Faraggi 1, Andrzej Kloczkowski 2


1- IUPUI, 2- OSU, NCH

We have participated in the Critical Assessment of Protein Structure Prediction (CASP)


experiment with four prediction procedures. The procedure described in this abstract is labeled as
group "Kloczkowski", number 114. This method is based on new version of the Seder program
[1,2], with new and improved input features as will be described in an upcoming manuscript. For
this procedure Seder was trained with soft and hard protein targets but tested only with hard
targets. We use CASP5 through CASP10 server models for training data, and CASP11 server
models for hard targets as a over-fit set. That is, in this case we train over all CASP targets and
optimize the prediction only on hard CASP11 targets. Hard targets are defined as those with a
best server model with less than 50% of the residues fitting within 6 Angstrom spheres using best
computed alignment. This special version of Seder is then used to pick among all CASP12
submitted server models. To estimate the B-factors, for the protein models we used the following
equation: B-factor = 300 * SPXASA / ( 1 + model-residue-depth), with SPXASA the SPINE-X
[3] predicted accessible surface area, and model-residue-depth is the residue depth reported from
the program DEPTH [4]. This model came from approximately fitting a distribution of
experimental B-factors.

1. Faraggi, Eshel, and Andrzej Kloczkowski. "A global machine learning based scoring function
for protein structure prediction." Proteins: Structure, Function, and Bioinformatics 82.5
(2014): 752-759.
2. Manuscript in preparation.
3. Faraggi, Eshel, et al. "SPINE X: improving protein secondary structure prediction by multistep
learning coupled with prediction of solvent accessible surface area and backbone torsion
angles." Journal of computational chemistry 33.3 (2012): 259-267.
4. Tan, Kuan Pern, Raghavan Varadarajan, and Mallur S. Madhusudhan. "DEPTH: a web server
to compute depth and predict small-molecule binding cavities in proteins." Nucleic acids
research 39.suppl 2 (2011): W242-W248.

87
KScons

Inferring Long Range Contacts using the Knob-Socket Model

K. Fraga, H. Joo, S. Patel and J. Tsai


Chemistry Department, University of the Pacific
jtsai@pacific.edu

In an exhaustive analysis of packing in the Protein Data Bank, the knob-socket tetrahedral
construct was identified as a motif to describe long range interactions in protein structure [1-4].
Application of this knob-socket principle to classification of protein structure reveals distinct
amino acid preferences for certain knob-socket arrangements. These preferences define a discrete
amino acid code for the spatial arrangement of protein residues in secondary and tertiary
structure. In particular, long range contacts of a knob residue distant in sequence is modeled
using a Bayesian approach to pack into a socket composed of residues local in sequence [5]. The
sockets were derived from regular lattices of predicted secondary structure.

Methods
As a new description of protein packing structure, the knob-socket model intuitively decomposes
the many-bodied complexity of residue packing into patterns of simple pairwise secondary
structure interactions between a single knob residue with a three residue socket. Because the
amino acid composition of a socket determines packing with a knob, the knob-socket construct is
used to incorporate structural information into the prediction of residue contacts in the program
KScons. Amino acid preferences in knob-socket constructs are incorporated into a Bayesian
model that considers a multiple structure alignment of homologs if possible and a marginal
posterior distribution.

The KScons program requires a secondary structure prediction of the target sequence. The
development of KScons did not incorporate methods where a secondary structure prediction
confidence could be considered in the contact prediction. Instead KScons assumes that the
secondary structure prediction is perfect. This was done in order to simplify the prediction
procedure. Enumerating all the possible sockets without a clear secondary structure is difficult.
By taking a secondary structure prediction to be true, progress can be made on the contact
prediction. In this CASP, the secondary structure prediction server Jpred was used [6]. Next, a
multiple sequence alignment based on structure is built from closely related known templates to
the target sequence using MUSCLE [7]. Therefore, KScons takes in the knob-sockets found in
the templates in order to build a Bayesian prior to predict the probability of knob-sockets to be
found in the target structure.

Availability
While a package for KScons is being prepared, currently KScons and data related to the knob-
socket are available upon request at jtsai@pacific.edu to the corresponding author.

1. Joo, H., et al., An amino acid packing code for α-helical structure and protein design. J Mol
Biol, 2012. 419(3-4): p. 234-254.

88
2. Joo, H. and J. Tsai, An amino acid code for β-sheet packing structure. Proteins, 2014. 82(9):
p. 2128-2140.
3. Joo, H., et al., An amino acid code for irregular and mixed protein packing. Proteins, 2015.
83(12): p. 2147–2161.
4. Fraga, K.J., H. Joo, and J. Tsai, An amino acid code to define a protein's tertiary packing
surface. Proteins, 2016. 84(2): p. 201-16.
5. Li, Q., et al., KScons: A Bayesian Approach for Protein Residue Contact Prediction using the
Knob-Socket Model of Protein Tertiary Structure. Bioinformatics, 2016.
6. Drozdetskiy, A., et al., JPred4: a protein secondary structure prediction server. Nucleic
Acids Res, 2015. 43(W1): p. W389-94.
7. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.

89
Laufer, Laufer_seed

Using Physics and Bayesian Inference for Structure Prediction

A. Perez, E. Brini, L.W. Votapka and K.A. Dill


Laufer Center, Stony Brook University
alberto.perezantonanzas@stonybrook.edu

We use a physics based approach to find the native state of proteins starting from server
predictions. We have recently developed Modeling Employing Limited Data (MELD)[1,2] to
accelerate sampling in Molecular Dynamics (MD) trajectories using Coarse Physical Insights
(CPI) (e.g. proteins have hydrophobic cores). This is the second time we have participated in
CASP using this physics based methodology. Access to the high end PRAC resource Blue
Waters allowed our group to tackle 29 of the proteins presented in CASP’s T0 category – an
impressive feat for a method that relies primarily on atomistic molecular dynamics simulations,
and that required several million GPU-hours on the Blue Waters supercomputer.

Methods
MELD combines MD and information in a Bayesian framework. We use OpenMM[3] as MD
engine and the AMBER force field, ff14SB[4,5]. We add a replica exchange protocol with a
modified Hamiltonian and temperature ladder. Our modifications to the Hamiltonian are intended
to cause it to guide the system to regions of conformational space that make hydrophobic
contacts, secondary structure elements, paired beta strands, hydrogen bonds, and other structural
features that are likely to be found in a protein. Hamiltonian modifications are imposed in the
form of flat bottomed harmonic potentials. These potentials bias the protein landscape such that
vast regions of space are made energetically unfavorable, excluding them from the protein’s
conformational search, and leaving protein-like conformations as they were in the original force
field. In Laufer_seed protocol, we use 30 replicas in our calculations, they are seeded from the
server predictions coming from the following 4 servers: Baker-RosettaServer, myprotein-me,
Quark and Zhang-Server. These structures are minimized with Amber and the ff14SB force field
before seeding them into the MELD-replica exchange protocol. All other setup is identical to the
Laufer group.
The CPIs that we use in MELD refer to protein-like features that we can use to accelerate
sampling. For CASP12, we have implemented 6 kinds of CPIs: hydrophobic packing,
confinement, secondary structure formation[6], strand pairing, hydrogen bonding, and
evolutionary data[7]. The unique feature about the methodology used in MELD for CASP12, is
that not all the information is imposed at the same time. Instead, only a certain percentage of
each type of CPI has to be satisfied – while ensuring that the information guiding the system at
every step remains deterministic. Hence, the whole protocol obeys the property of detailed
balance, which allows us to use the principles of statistical mechanics to analyze our trajectories
to retrieve free energies and construct our predicted structures and folding models.
For each kind of CPI, we obtain a value for the expected accuracy. For secondary
structure predictions, we enforce 70% of the predicted data. The implementation details are
specified in earlier studies.[1,2] Hydrophobic residues in the sequence that are more than 6
residues apart are assigned flat bottom harmonic restraints between heavy atoms to bias them to

90
aggregate together, composing the hydrophobic heuristic. This generates a list of restraints that
scales quadratically with protein size, making the hydrophobic restraints the most expensive
restraints to enforce in most MELD calculations. We assume that approximately every
hydrophobic residue will be interacting with 1.2 other hydrophobic residues in a typical globular
protein, as a result, we only enforce enough hydrophobic restraints that Nh*1.2 hydrophobic
residues will be biased toward contact at any one time. Similarly, for hydrogen bonds, we
consider all possible donors and acceptors that are further than 4 residues apart in sequence and
closer than 10, but only enforce 10% of these pairs that could possibly hydrogen bond. For strand
pairing, the range of the flat-bottomed wells that we impose are much shorter: the flat bottom of
the potential is only 3.5Å long. We enforce restraints between all nitrogen and oxygen atoms of
residues predicted to be in the extended conformation[6] and then enforce 45% of these pairs.
Each of these CPIs are adding a large number of restraints, most of which are incorrect
and do not exist in the native fold of the protein. Fortunately, the Bayesian approach in
conjunction with the force field can identify which data is more compatible with the force field –
folding the protein within a span of time up to 5 orders of magnitude shorter than if the CPIs
were not used[2]. This speed-up is crucial; our simulation setup consists of 30 replicas each of
which runs for as long as possible within the time allotted by the CASP organizers and resource
availability, up to a total of 2.5µs for each replica. We use GPU technology to ensure that the
simulations run as fast as possible. When the simulation completes, we cluster the lowest 5
temperature replicas using an average linkage clustering algorithm[8]. The population of the
cluster is related to the free energy of the members of that cluster, so we choose the centroids of
the top 5 population clusters for submission to CASP. As a final step, we minimize each centroid
structure to the geometrical average of that cluster[9,10] and submit the resulting structures to the
CASP interface.

Availability
OpenMM, AmberTools, MELD and the MELD-OpenMM plugin are all available and free to use.
Our MELD can be accessed at: git@github.com:maccallumlab/.

1. J.L. MacCallum, A. Perez, K. Dill, Determining protein structures by combining semireliable


data with atomistic physical models by Bayesian inference, Proc. Natl. Acad. Sci. U. S. a.
112 (2015) 6985–6990.
2. A. Perez, J.L. MacCallum, K. Dill, Accelerating molecular simulations of proteins using
Bayesian inference on weak information, Proc. Natl. Acad. Sci. U. S. a. 112 (2015) 11846–
11851.
3. P. Eastman, M.S. Friedrichs, J.D. Chodera, R.J. Radmer, C.M. Bruns, J.P. Ku, et al., OpenMM
4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular
Simulation, J Chem Theory Comput. 9 (2013) 461–469.
4. J.A. Maier, C. Martinez, K. Kasavajhala, L. Wickstrom, K.E. Hauser, C. Simmerling, ff14SB:
Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB, J
Chem Theory Comput. 11 (2015) 3696–3713.
5. D.A. Case, V. Babin, J. Berryman, R.M. Betz, Q. Cai, Amber 14, San Franicsco, 2014.
6. D.T. Jones, Protein secondary structure prediction based on position-specific scoring matrices,
J. Mol. Biol. 292 (1999) 195–202.
7. H. Kamisetty, S. Ovchinnikov, D. Baker, Assessing the utility of coevolution-based residue-
residue contact predictions in a sequence- and structure-rich era, Proc. Natl. Acad. Sci. U. S.

91
a. 110 (2013) 15674–15679.
8. X. Daura, K. Gademann, B. Jaun, D. Seebach, W.F. van Gunsteren, A.E. Mark, Peptide
folding: When simulation meets experiment, Angewandte Chemie International Edition. 38
(1999) 236–240.
9. A. Perez, A. Roy, K. Kasavajhala, A. Wagaman, K. Dill, J.L. MacCallum, Extracting
representative structures from protein conformational ensembles, Proteins. 82 (2014) 2671–
2680.
10. V. Mirjalili, M. Feig, Protein Structure Refinement through Structure Selection and
Averaging from Molecular Dynamics Ensembles, J Chem Theory Comput. 9 (2013) 1294–
1303.

92
LEE

Protein structure modeling by global optimization

Keehyoung Joo1,2, Seung Hwan Hong1,3, In Suk Joung1,3, Balachandran Manavalan1,3, Jose
Christian Flores Canales1,3, Qianyi Cheng1,3, Seungryong Heo1, Jong Yun Kim1, Sun Young Lee1,
Mikyung Nam1, In-Ho Lee1,4, Sung Jong Lee1,5, and Jooyoung Lee1,2,3*
1-Center for In-Silico Protein Science, Korea Institute for Advanced Study, 130-722, Korea, 2-Center for Advanced
Computation, Korea Institute for Advanced Study, 130-722, Korea, 3-School of Computational Sciences, Korea
Institute for Advanced Study, 130-722, Korea, 4-Korea Research Institute of Standards and Science (KRISS), 305-
340, Korea, 5-Department of Physics, University of Suwon, Hwaseong-Si, 445-743, Korea
*jlee@kias.re.kr

See GOAL, LEE

93
LNCCUnB

Template--free protein structure prediction using atomic burials prediction with a multiple
minima genetic algorithm

F.L. Custódio1, D.C. Ferreira2, G.K. Rocha1, E. Correia1, A.F. Pereira de Araújo2 and
L.E. Dardenne1
1 - Laboratório Nacional de Computação Científica, Petrópolis-­RJ, Brasil, 2 -Laboratório de Biofísica Teórica e
Computacional, Departamento de Biologia Celular, Universidade de Bras ília, Brasília­DF, Brasil
flc@lncc.br

We built a template--free fragment--based PSP protocol (LNCCUnB workflow) that is based on


two main ideas: the use of atomic burials as informational intermediates between sequence and
structure and the design of a multiple minima genetic algorithm for structure determination. We
employ a coarse-grained representation where all backbone atoms are explicit, with the side
chains modeled as a single superatom. The scoring function combines some physically realistic
potential with knowledge based terms to promote hydrogen bonding and secondary structure
organization, with a statistically derived potential that drives each atom of the structure to its
predicted burial level. Global optimization is carried out by the multiple minima genetic
algorithm (GA) and no further refinement is performed. Selection of the models is then done by
means of structural redundancy filtering and pruning by burial energy. The LNCCunB workflow
was applied to 41 targets of the “all groups” category.

Methods
Our workflow starts with (1) secondary structure prediction by PSIPRED1 followed by (2)
Fragment libraries creation with Profrager2 (https://www.lncc.br/sinapad/Profrager/), and
fragments are selected using the secondary structure prediction in addition to the local sequence
similarities from a culled database of 24,237 experimental structures. We define (3) atomic
burials in terms of the distance of each atom of the protein to its structural geometric center.
Protein structures are divided into four concentric layers, and we employ a supervised learning
methodology, based on a Hidden Markov Model3 , to associate each atom of the target protein to
one of these layers. This step employs a training set of 2329 proteins of known structure, derived
from the PDBSELECT4 list of March 2012, which is intended to maximize structural diversity
while minimizing sequence redundancy. A probability function which describes the distribution
of atomic burials5 within the same training set is then used to reassign each atom to one among
five (or six, for some targets) concentric layers.
The (4) conformational search is carried out by GAPF6 which employs a genetic
algorithm (GA) with seven genetic operators including Ramachandran based mutations7 and
fragment insertion. The GA methodology uses a scoring function with a proper dihedral and a
steric repulsion terms, hydrophobic compaction, hydrogen bonding formation8, cooperative
hydrogen bonding9 and atomic burials2. It employs a phenotype- based crowding mechanism for
the maintenance of useful diversity within the populations, which has been shown to result in
increased performance and to grant the algorithm multiple solution capabilities. For each target,
we perform 20 independent runs of the GA, each with populations containing 200 individuals,
resulting in 2000 structures. These results undergo a (5) structural redundancy filter and the

94
overall top five structures, ranked by burial energy, proceeded to the next steps. (6) Side chains
of the select structures are reconstructed using SCWRL410. And finally the files are (7) formatted
according to CASP guidelines, including (8) filling the temperature column of the pdb files with
the atomic burial energy (lower is better).

Availability
Fragment library generation with Profrager is available at:
https://www.lncc.br/sinapad/Profrager/. Other tools are freely available from their authors.
Atomic burials prediction was carried out by HmmPred avaible at:
http://www.btc.unb.br/index.php. Other steps in the protocol were carried out with experimental
versions of our software that will be made available by contacting the authors. Acknowledge:
FAPERJ (grant E-26/010.001229/2015), Santos Dumont supercomputer team, AMT, and Intel
Brasil.

1. McGuffin, L. J., Bryson, K., & Jones, D. T. (2000). The PSIPRED protein structure
prediction server. Bioinformatics, 16(4), 404-405.
2. Santos, K. B., Trevizani, R., Custodio, F. L., & Dardenne, L. E. (2015, January). Profrager
Web Server: Fragment Libraries Generation for Protein Structure Prediction. InProceedings
of the International Conference on Bioinformatics & Computational Biology (BIOCOMP) (p.
38). The Steering Committee of The World Congress in Computer Science, Computer
Engineering and Applied Computing (WorldComp).
3. Linden, M. G., Ferreira, D. C., Oliveira, L. C., Onuchic, J. N., & Pereira de Araújo, A. F.
(2014). Ab initio protein folding simulations using atomic burials as informational
intermediates between sequence and structure. Proteins: Structure, Function, and
Bioinformatics, 82(7), 1186-1199.
4. Gomes, A. L. C., Rezende, J. R., Pereira de Araújo, A. F., & Shakhnovich, E. I. (2007).
Description of Atomic Burials in Compact Globular Proteins by Fermi-Dirac Probability
Distributions. Proteins 66, 304–320.
5. Hobohm, U., & Sander, C. (1994). Enlarged representative set of protein structures.Protein
Science, 3(3), 522-524.
6. Custódio, F. L., Barbosa, H. J., & Dardenne, L. E. (2014). A multiple minima genetic
algorithm for protein structure prediction.Applied Soft Computing, 15, 88-99.
7. Santos, K. B., Custódio, F. L., Barbosa, H. J., & Dardenne, L. E. (2015, August). Genetic
operators based on backbone constraint angles for protein structure prediction.
InComputational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2015
IEEE Conference on (pp. 1-8). IEEE.
8. Rocha, G. K., Custódio, F. L., Barbosa, H. J. C., & Dardenne, L. E. (2015, August). A
multiobjective approach for protein structure prediction using a steady-state genetic
algorithm with phenotypic crowding. In Computational Intelligence in Bioinformatics and
Computational Biology (CIBCB), 2015 IEEE Conference on (pp. 1-8). IEEE.
9. Levy-Moonshine, A., Amir, E. A. D., & Keasar, C. (2009). Enhancement of beta-sheet
assembly by cooperative hydrogen bonds potential. Bioinformatics, 25(20), 2639-2645.
10. Krivov, G. G., Shapovalov, M. V., & Dunbrack, R. L. (2009). Improved prediction of protein
side‐chain conformations with SCWRL4. Proteins: Structure, Function, and Bioinformatics,
77(4), 778-795.

95
McGuffin

Manual Prediction of Protein Tertiary and Quaternary Structures, 3D Model Refinement


and Interface Accuracy Assessment

L.J. McGuffin1, A.N. Shuid1, R. Kempster1, J.O. Nealon1, B.R. Salehe1, J.D. Atkins1 and
D.B. Roche2,3
1 - School of Biological Sciences, University of Reading, Reading, UK, 2 - Institut de Biologie Computationnelle ,
LIRMM, CNRS-UMR 5506, Université de Montpellier, Montpellier, France, 3 - Centre de Recherche en Biologie
cellulaire de Montpellier, CNRS-UMR 5237, Montpellier, France
l.j.mcguffin@reading.ac.uk

For our manual predictions we made use of several components from our latest servers1,2 (see
our IntFOLD4 and ModFOLD6 server abstracts for more detail). For our tertiary structure (TS)
predictions we also made use of the CASP hosted 3D server models, ranked using
ModFOLD6_rank and refined with our new refinement method (ReFOLD). For our quaternary
structure predictions, we used a docking and template based approach (MultiFOLD) along with
considerable amount of manual intervention, including gaining clues from likely ligand binding
sites (FunFOLD3). For the new category of Interface Accuracy Assessment, we developed a new
structure comparison based method (ModFOLDIA) to evaluate our own quaternary structure
predictions after submission.

Methods

Tertiary structure predictions: The server models were ranked according the ModFOLD6_rank
global quality scores (see our ModFOLD6 abstract). The top ranked initial model was then
selected and submitted to the ReFOLD and MultiFOLD pipelines described below. For each
model, the ModFOLD6 predicted per-residue error scores were added into the B-factor column
for each set of atom records.

Refinement (ReFOLD): Our new refinement pipeline, ReFOLD, consisted of three protocols. The
first used a rapid iterative strategy (i3Drefine3) and the second protocol employed a more
CPU/GPU intensive molecular dynamic simulation strategy (using NAMD4) to refine each
starting model. Refined models generated from each protocol were assessed and ranked using
ModFOLD6_rank. The third protocol was a combination of the first 2 approaches, where the top
ranked model from the 2nd protocol was further refined using i3Drefine. Finally, all of the
refined models generated by each of these protocols and the starting model were pooled and re-
ranked again using ModFOLD6_rank and the final top 5 models were selected and submitted.

Quaternary structure predictions (MultiFOLD): The highest scoring models from the ReFOLD
procedure, described above, were used to generate predicted quaternary structures using LZerD5,
MEGADOCK6, FRODOCK7, PatchDock8 and ZDOCK9 for dimeric complexes, and M-
ZDOCK10 and Multi-LZerD11 for multimeric complexes. In addition to the docking strategy, a

96
multimeric fold recognition approach was also investigated. The fold template lists (with PDB
and chain IDs) generated by the IntFOLD server1 were filtered using multimeric data extracted
from PISA12 for each template. Model assemblies were then constructed using TM-align13 for
structural superposition of tertiary models onto assemblies and PyMOL was used for
visualisation and manual quality checking of the template generated models. The final predicted
quaternary structures were then ranked for submission using a variety of methods, including each
programs internal ranking procedures, other ranking software, and manual intervention, using
information provided by CASP concerning the origin of the protein and information from our
FunFOLD3 method regarding function and locations of putative bound ligands.

Interface Accuracy Assessment (ModFOLDIA): Late on in the prediction season we developed


the ModFOLDIA method to evaluate our own quaternary structures after submission. The
method was used for IMODE1 in CASP12, but can also be used for IMODE2 predictions in
future CASP experiments. The ModFOLDIA method carries out structure based comparisons of
alternative oligomer models. The first stage of the method was to identify the interface residues
in the model to be scored (defined as <= 5Å between the heavy atoms in different chains) and
then obtain the minimum contact distance (Dmin) for each contacting residue. The second stage
was to locate the equivalent residues in all other models and then obtain the mean minimum
distances of those residues in all other models (MeanDmin). The final IA score for each of the
interface residues in the model was the absolute difference in the Si from the mean Si : IA = 1-|Si-
MeanSi|, where Si = 1/(1+(Dmin/20)2) and MeanSi = 1/(1+(MeanDmin/20)2).

Availability
Our software will be freely available after publication from:
http://www.reading.ac.uk/bioinf/downloads/
Methods will soon be integrated with the IntFOLD server:
http://www.reading.ac.uk/bioinf/IntFOLD/

1. McGuffin,L.J., Atkins,J., Salehe,B.R., Shuid,A.N., Roche,D.B. (2015) IntFOLD: an


integrated server for modelling protein structures and functions from amino acid sequences.
Nucleic Acids Res. 43, W169-73.
2. McGuffin,L.J., Buenavista,M.T., Roche,D.B. (2013) The ModFOLD4 Server for the Quality
Assessment of 3D Protein Models. Nucleic Acids Res. 41, W368-72.
3. Bhattacharya,D., Cheng,J. (2013) i3Drefine software for protein 3D structure refinement and
its assessment in CASP10. PLoS One. 8, e69648.
4. Phillips,J.C., Braun,R., Wang,W., Gumbart,J., Tajkhorshid,E., Villa,E., Chipot,C., Skeel,R.D.,
Kalé,L., Schulten,K.J. (2005) Scalable molecular dynamics with NAMD. Comput Chem. 26,
1781-802.
5. Venkatraman,V., Yang,Y.D., Sael,L., Kihara,D. (2009) Protein-protein docking using region-
based 3D Zernike descriptors. BMC Bioinformatics. 10, 407.
6. Ohue,M., Shimoda,T., Suzuki,S., Matsuzaki,Y., Ishida,T., Akiyama,Y. (2014) MEGADOCK
4.0: an ultra–high-performance protein–protein docking software for heterogeneous
supercomputers. Bioinformatics. 30, 3281–3283.
7. Garzon,J.I., Lopéz-Blanco,J.R., Pons,C., Kovacs,J., Abagyan,R., Fernandez-Recio,J.,

97
Chacon,P. (2009). FRODOCK: a new approach for fast rotational protein–protein docking.
Bioinformatics. 25, 2544–2551.
8. Duhovny,D., Nussinov,R., Wolfson,H.J. (2002) Efficient unbound docking of rigid
molecules, in: International Workshop on Algorithms in Bioinformatics. Springer, pp. 185–
200.
9. Chen,R., Li,L., Weng,Z. (2003) ZDOCK: An initial-stage protein-docking algorithm.
Proteins. 52, 80–87.
10. Pierce,B., Tong,W., Weng,Z. (2005) M-ZDOCK: a grid-based approach for Cn symmetric
multimer docking. Bioinformatics. 21, 1472–1478.
11. Esquivel-Rodríguez,J., Yang,Y.D., Kihara,D. (2012) Multi-LZerD: Multiple protein docking
for asymmetric complexes. Proteins. 80, 1818-1833.
12. Krissinel,E., Henrick,K. (2007) Inference of Macromolecular Assemblies from Crystalline
State. J. Mol. Biol. 372, 774–797.
13. Zhang,Y., Skolnick,J. (2005) TM-align: a protein structure alignment algorithm based on the
TM-score. Nucleic Acids Res. 33, 2302-9.

98
MELDingKScons

Inferring Long Range Contacts using the Knob-Socket Model

A. Perez1, K. Fraga2, H. Joo2, S. Patel2 and J. Tsai2


1- Laufer Center, Stony Brook Universit, 2- Chemistry Department, University of the Pacific
alberto.perezantonanzas@stonybrook.edu

A combined MELD [1, 2] and KScons [3] method was used. Long range contacts were
prediction using KScons and the used by MELD, a molecular dynamics method that builds
atomic structures of proteins given a set of distance constraints, to produce a structure. See the
Laufer abstract for more details on MELD and the KScons abstract.

Methods
As a new description of protein packing structure, the knob-socket model intuitively decomposes
the many-bodied complexity of residue packing into patterns of simple pairwise secondary
structure interactions between a single knob residue with a three residue socket [4-7]. Because
the amino acid composition of a socket determines packing with a knob, the knob-socket
construct is used to incorporate structural information into the prediction of residue contacts in
the program KScons. The secondary structure prediction server Jpred was used [8]. Amino acid
preferences in knob-socket constructs are incorporated into a Bayesian model that considers a
multiple structure alignment of homologs using MUSCLE [9] if possible and a marginal
posterior distribution.

MELD combines MD and information in a Bayesian framework. It uses a replica exchange


protocol with a modified Hamiltonian and temperature ladder. Modifications to the Hamiltonian
are intended to cause it to guide the system to regions of conformational space that make
hydrophobic contacts, secondary structure elements, paired beta strands, hydrogen bonds, and
other structural features that are likely to be found in a protein. Hamiltonian modifications are
imposed in the form of flat bottomed harmonic potentials. These potentials bias the protein
landscape such that vast regions of space are made energetically unfavorable, excluding them
from the protein’s conformational search, and leaving protein-like conformations as they were in
the original force field.

Availability
While a package for KScons is being prepared, currently KScons and data related to the knob-
socket are available upon request at jtsai@pacific.edu to the corresponding author.

OpenMM, AmberTools, MELD and the MELD-OpenMM plugin are all available and free to use.
Our MELD frontend can be accessed at: git@github.com:maccallumlab/meld.git, and the
MELD-openMM plugin can be accessed at: git@github.com:maccallumlab/meld-openmm-
plugin.git

1. MacCallum, J.L., A. Perez, and K.A. Dill, Determining protein structures by combining
semireliable data with atomistic physical models by Bayesian inference. Proc Natl Acad

99
Sci U S A, 2015. 112(22): p. 6985-90.
2. Perez, A., J.L. MacCallum, and K.A. Dill, Accelerating molecular simulations of proteins
using Bayesian inference on weak information. Proc Natl Acad Sci U S A, 2015. 112(38):
p. 11846-51.
3. Li, Q., et al., KScons: A Bayesian Approach for Protein Residue Contact Prediction using
the Knob-Socket Model of Protein Tertiary Structure. Bioinformatics, 2016.
4. Joo, H., et al., An amino acid packing code for α-helical structure and protein design. J
Mol Biol, 2012. 419(3-4): p. 234-254.
5. Joo, H. and J. Tsai, An amino acid code for β-sheet packing structure. Proteins, 2014.
82(9): p. 2128-2140.
6. Joo, H., et al., An amino acid code for irregular and mixed protein packing. Proteins,
2015. 83(12): p. 2147–2161.
7. Fraga, K.J., H. Joo, and J. Tsai, An amino acid code to define a protein's tertiary packing
surface. Proteins, 2016. 84(2): p. 201-16.
8. Drozdetskiy, A., et al., JPred4: a protein secondary structure prediction server. Nucleic
Acids Res, 2015. 43(W1): p. W389-94.
9. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.

100
MESHI_CON_SERVER

N-MESHI-Score – A decoy QA method

T. Sidi, C. Keasar
Ben-Gurion University of the Negev
chen.keasar@gmail.com

Quality Assessment (QA) of protein decoys is an essential component in any method for protein
structure prediction. It bridges between the tendency of prediction methods to generate numerous
models per target sequence, and the inability of biologists and chemists to use more than a few.
Despite much research in this field, selecting the best decoy out of a set of alternatives (decoy-
set) is still a hard task. Providing an absolute QA seems even harder, as only few methods
address it.
The CASP12 QA server MESHI_CON_SERVER uses N-MESHI-Score an extension of the
MESHI_Score method, which is used by MESHI_SERVER. This extension aims to cope with
two limitations of the individual-decoy-based MESHI_Score: First, MESHI_Score has relatively
poor discrimination among close-to-native decoys. Second, often we see diversity of scores
among very similar decoys. To alleviate these problems N-MESHI-Score associates its decoy
with a set of neighbors. It then uses the neighbors to tune both predictor training and final score
calculation.

Methods
The Methods section focuses on the few differences between N_MESHI_Score and
MESHI_Score as the latter is presented in the MESHI_SERVER abstract. In a nutshell,
MESHI_Score first standardizes the decoys by scwrl47 rotamer optimization followed by energy
minimization. Then, the protocol extracts 106 structural features from each decoy and feed them
to MESHI_Score, an ensemble of a thousand independent predictors. Each of these predictors is
trained to predict decoy qualities using a unique subset of the features. The final MESHI_Score
is the weighted median of the thousand individual scores. Below we present the modifications of
this scheme in N_MESHI_Score.

Decoy pre-processing – In addition to the MESHI_SERVER pre-processing of individual


decoys, each decoy in the set is compared to the other and is associated with a group of
neighboring decoys, which are as close as ≥ 0.95 GDT_TS.
Predictor training – The training of a single predictor is similar to the training in
MESHI_SERVER, but uses a different objective function. This function penalizes disagreement
between the predicted and actual quality (GDT_TS) of the decoy and rewards correlation
between the predicted and actual qualities of neighboring decoys.
Score – Like in MESHI_SERVER the score of MESHI_CON_SERVER is based on the weighted
median of the scores provided by thousand independent predictors. Here however, the final score
is biased towards the scores of the neighboring decoys to ensure their consistency.

101
Availability
The MESHI software package (version 9.29, which was used in CASP12) is available in:
https://www.dropbox.com/sh/mb31bjdvvydhuzh/AADVcclTZKtFiSl6I9hBx8Dxa?dl=0

1. Kryshtafovych,A. et al. (2014) Assessment of the assessment: Evaluation of the model


quality estimates in CASP10. Proteins, 82, 112–126.
2. Summa,C.M. and Levitt,M. (2007) Near-native structure refinement using in vacuo energy
minimization. PNAS, 104, 3177–3182.
3. Zhou,H. and Skolnick,J. (2011) GOAP: A Generalized Orientation-Dependent, All-Atom
Statistical Potential for Protein Structure Prediction. Biophysical Journal, 101, 2043–2052.
4. Samudrala,R. and Moult,J. (1998) An all-atom distance-dependent conditional probability
discriminatory function for protein structure prediction1. Journal of Molecular Biology, 275,
895–916.
5. Amir,E.-A.D. et al. (2008) Differentiable, multi-dimensional, knowledge-based energy terms
for torsion angle probabilities and propensities. Proteins, 72, 62–73.
6. Levy-Moonshine,A. et al. (2009) Enhancement of beta-sheet assembly by cooperative
hydrogen bonds potential. Bioinformatics, 25, 2639–2645.
7. Krivov,G.G. et al. (2009) Improved prediction of protein side-chain conformations with
SCWRL4. Proteins, 77, 778–795.
8. Benkert,P. et al. (2011) Toward the estimation of the absolute quality of individual protein
structure models. Bioinformatics, 27, 343–350.
9. Mirzaei,S. et al. (2016) Purely Structural Protein Scoring Functions Using Support Vector
Machine and Ensemble Learning. IEEE, in press.
10. Petersen,T.N. et al. (2000) Prediction of protein secondary structure at 80% accuracy.
Proteins, 41, 17–20.

102
MESHI_SERVER

MESHI-Score – an individual decoy QA method

T. Sidi, C. Keasar
Ben-Gurion University of the Negev
chen.keasar@gmail.com

Quality Assessment (QA) of protein decoys is an essential component in any method for protein
structure prediction. It bridges between the tendency of prediction methods to generate numerous
decoys per target sequence, and the inability of biologists and chemists to use more than a few.
Despite much research in this field, selecting the best decoy out of a set of alternatives (decoy-
set) is still a hard task. Providing an absolute QA seems even harder, as only few methods
address it.
Our QA server in CASP12 uses MESHI_Score, an ensemble learning method. It combines up
to a thousand independently trained predictors to provide an estimate of the absolute qualities
(GDT_TS) of individual decoys. To this end MESHI_Score uses a set of 106 features. This set
includes energy terms, as well as compatibility with sequence based solvent accessibility and
secondary structure prediction.

Methods
Training and validation Data sets - The dataset consists of 86365 non-trivial (no short fragments
and no extended chains) decoys of 362 non-redundant CASP targets (CASP8-11, targets
with 𝐸 − 𝑣𝑎𝑙𝑢𝑒 < 1𝐸 −4 were discarded and targets with 1𝐸 −4 < 𝐸 − 𝑣𝑎𝑙𝑢𝑒 < 1𝐸 −2 were inspected
manually). A 10% of randomly chosen targets we used as a validation set to ensure the quality of
the trained function.
Decoy pre-processing – Protein decoys often have structural, server-specific, peculiarities, which
are not related to their quality as measured by GDT_TS. MESHI_Score apply a pre-processing
step, which aims to standardize the decoys by reducing server based biases. It includes
SCWRL47 side-chain optimization followed by specially-constrained energy minimization by
MESHI11.
Features – At the heart of MESHI-Score is a large set of structural features that may be
calculated for each decoy regardless of any other one:
1. Strictly structural features that are measured directly from the decoy structures:
a. Pairwise potentials adopted from the literature: GOAP3, Summa and Levitt KB-PMF2,
ramp4, and SCWRL4 energy.
b. In-house developed structural features
(https://www.cs.bgu.ac.il/~frankel/TechnicalReports/2015/15-06.pdf): that includes (a)
meta-features that consider the distribution of the pairwise potential values within the
decoy; (b) standard bonded energy terms (e.g., quadratic bond term); (c) torsion angle
terms5; (d) hydrogen bond energy terms6; (e) solvation and atom environment terms; (f)
radius of gyration terms that quantify the compatibility of decoys with the observed,
length dependent, radius of gyration ratios between different subsets of protein atoms

103
(e.g., polar and hydrophobic), (g) contact terms that quantify the compatibility of decoys
with the observed, length and atom-type dependent average number of contacts per-atom.
2. Compatibility of the decoy secondary structure and solvent accessibility with their PSI-
PRED prediction10.
3. Combinations of the above features, which were developed in previous studies9.
Predictor training – Each predictor is assigned a different sensitivity-region, from the
GDT_TS's zero-to-one range. The training process aims at reducing the per-target mean of
prediction error, with a higher weight on the sensitivity-region. To this end, a Monte-Carlo
optimization selects up to 15 features (out of 106) and combines them in a non-linear fashion.
Introducing feature selection into the optimization process results in an ensemble of independent
predictors, which use diverse sets of features.
Score – The score of a protein decoy is the weighted median of the scores provided by the
independent predictors. As the final score calculation does not use learned parameters, it allows
the use of a large number of informative features with minimal overfitting.

Availability
The MESHI software package (version 9.29, which was used in CASP12) is available in:
https://www.dropbox.com/sh/mb31bjdvvydhuzh/AADVcclTZKtFiSl6I9hBx8Dxa?dl=0

1. Kryshtafovych,A. et al. (2014) Assessment of the assessment: Evaluation of the model


quality estimates in CASP10. Proteins, 82, 112–126.
2. Summa,C.M. and Levitt,M. (2007) Near-native structure refinement using in vacuo energy
minimization. PNAS, 104, 3177–3182.
3. Zhou,H. and Skolnick,J. (2011) GOAP: A Generalized Orientation-Dependent, All-Atom
Statistical Potential for Protein Structure Prediction. Biophysical Journal, 101, 2043–2052.
4. Samudrala,R. and Moult,J. (1998) An all-atom distance-dependent conditional probability
discriminatory function for protein structure prediction1. Journal of Molecular Biology, 275,
895–916.
5. Amir,E.-A.D. et al. (2008) Differentiable, multi-dimensional, knowledge-based energy terms
for torsion angle probabilities and propensities. Proteins, 72, 62–73.
6. Levy-Moonshine,A. et al. (2009) Enhancement of beta-sheet assembly by cooperative
hydrogen bonds potential. Bioinformatics, 25, 2639–2645.
7. Krivov,G.G. et al. (2009) Improved prediction of protein side-chain conformations with
SCWRL4. Proteins, 77, 778–795.
8. Benkert,P. et al. (2011) Toward the estimation of the absolute quality of individual protein
structure models. Bioinformatics, 27, 343–350.
9. Mirzaei,S. et al. (2016) Purely Structural Protein Scoring Functions Using Support Vector
Machine and Ensemble Learning. IEEE/ACM Transactions on Computational Biology and
Bioinformatics, in press.
10. Petersen,T.N. et al. (2000) Prediction of protein secondary structure at 80% accuracy.
Proteins, 41, 17–20.
11. Kalisman N., et al. (2005) MESHI: a new library of Java classes for molecular modeling.
Bioinformatics 21:3931-3932.

104
MetaPSICOV

Consensus Coevolution-based Protein Contact Prediction using MetaPSICOV 2.0

D.T. Jones
Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E
6BT, United Kingdom
d.t.jones@ucl.ac.uk, URL: http://bioinf.cs.ucl.ac.uk

This method tests a novel, fully automated predictor of protein contacts based on coevolution
algorithms developed by the Bioinformatics Group at UCL.

Methods

Recent developments in contact prediction algorithms based on inferring coevolution between


residue pairs in large multiple sequence alignments have attracted a great deal of attention in
recent years. Contact prediction accuracy is now reaching a point where it offers a viable means
for accurate 3-D modelling of proteins with no other information being required. MetaPSICOV
[1] is a neural network-based method which not only combines three distinct approaches
(PSICOV [2], DCA/FreeContact [3], CCMpred [4]) for inferring coevolutionary signals from
multiple sequence alignments, but also considers a broad range of other sequence derived
features such as secondary structure, solvent accessibility and, uniquely, a range of metrics which
describe both the local and global quality of the input multiple sequence alignment, which was
generated using HHBLITS [5]. A second-stage predictor, where the second stage effectively
filters the output of the first stage is also employed to further enhance the quality of contact
prediction. To facilitate easy comparison to other methods, MetaPSICOV was trained on a set of
624 protein structures with no detectable homology to the original PSICOV benchmark set.
The current version of MetaPSICOV (V2.0) makes use of a deeper neural network
architecture than the original, comprising two hidden layers of 160 units each and a larger
number of input features in the first stage network. The second stage network was more or less
unchanged, although both networks used the rectifier activation function rather than the sigmoid
function to improve training.
The MetaPSICOV server is a prototype server which automates all of the MetaPSICOV
processing steps and was designed specifically for the CASP12 experiment. A publicly available
server with a more sophisticated user interface should be available by the end of 2016.

Results

A total of 89 predictions were submitted during the CASP12 experiment, and as these predictions
were generated automatically by a server, no human intervention was employed.

1. Jones, DT, Singh, T, Kosciolek, T, & Tetchner, S. (2015) MetaPSICOV: combining


coevolution methods for accurate prediction of contacts and long range hydrogen bonding in
proteins. Bioinformatics. 31, 999-1006.
2. Jones, D.T., Buchan, D.W., Cozzetto, D. & Pontil, M. (2012) PSICOV: Precise structural

105
contact prediction using sparse inverse covariance estimation on large multiple sequence
alignments. Bioinformatics. 28, 184-190.
3. Kaján, L., et al. (2014) FreeContact: fast and free software for protein contact prediction from
residue co-evolution, BMC bioinformatics, 15, 85.
4. Seemayer S., et al. (2014) CCMpred—fast and precise prediction of protein residue–residue
contacts from correlated mutations. Bioinformatics, 30, 3128–3130.
5. Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence
searching by HMM-HMM alignment. Nat Methods. 2011 Dec 25;9(2):173-5.
doi:10.1038/nmeth.1818.

106
ModFOLD6, ModFOLD6_cor, ModFOLD6_rank

Automated 3D Model Quality Assessment using the ModFOLD6 Server

L.J. McGuffin and A.H.A. Maghrabi


School of Biological Sciences, University of Reading, Reading, UK
l.j.mcguffin@reading.ac.uk

The ModFOLD6 server is the latest version of our resource for the Quality Assessment (QA) of
3D models of proteins1,2,3.

Methods
ModFOLD6 is our new approach to QA that combined a pure-single and quasi-single model
strategy for improving accuracy. Our emphasis was on increasing the accuracy of per-residue
assessments for single models. Each model was considered individually using 3 pure-single
model methods, ProQ24 and 2 newly developed methods: the Contact Distance Agreement
(CDA) score and the Secondary Structure Agreement (SSA) score. Additionally, a set of 130
reference 3D models (generated using InFOLD4, see our other abstract) was used to score
models using 3 alternative quasi-single model methods: the Disorder B-factor Agreement (DBA)
score, the ModFOLDclust_single_res score and the ModFOLDclustQ_single_res score. A neural
network (NN) was then used to combine the component per-residue/local quality scores from
each of the 6 alternative scoring methods, resulting in a final consensus of per-residue quality
scores for each model.

Component per-residue/local quality scoring methods:


1. CDA is new pure-single model local QA method that relates to the agreement between
the predicted residue contacts according to MetaPSICOV5 and the measured Euclidean distance
(in Å) between residues in the model. All pairs of residues in a model that were measured to be
8Å apart or less were considered and the CDA score for each residue was calculated by the mean
MetaPSICOV p-value for that residue. Thus, CDA = sum(p/c), where p is the probability of the
two residues being in contact according to MetaPSICOV and c is the number of contacts <= 8Å
for the residue in the model, where a value for p exists.
2. SSA is a simple new pure-single model local QA method that relates to the agreement
between the predicted secondary structure of each residue according to PSIPRED6 and the
secondary structure state of the residue in the model according to DSSP7. Thus, SSA=pCHE,
where, pCHE is simply the p-value from PSIPRED for the secondary structure state - coil (C),
helix (H) or strand (E) - of the residue in the model according to DSSP (6-states reduced to 3
states).
3. The local scores were also taken from the ProQ24 method.
4. The ModFOLDclust_single_res local QA scores were calculated from the comparison
of each model with the reference set of 130 models built by IntFOLD4, in a similar way to the
ModFOLD4 server3 acting in quasi-single model mode, with the predicted distances d converted
to similarities, thus: Sr = 1/(1+(d/3.9)2).

107
5. The ModFOLDclustQ_single_res local QA scores were calculated in a similar way to
ModFOLDclust_single_res, however, in this case individual models were compared against the
reference IntFOLD4 set using the local Q-score approach8.
6. DBA is a new quasi-single model QA method that relates to the agreement between the
predicted disordered residues in the sequence according to DISOPRED39 and the
ModFOLDclust_single_res predicted per-residue error. Thus, DBA=1-|Sr-(1-Pd)|, where, Sr is the
ModFOLDclust_single_res accuracy of the predicted residue for the model and Pd is the
probability of disorder according to DISOPRED3.
The final ModFOLD6 per-residue similarity scores were calculated using a simple multi-
layer NN, which takes as its input a sliding window (size=5) of per-residue scores from each the
6 methods described above, and outputs a single quality score for each residue in the model (30
inputs, 15 hidden, 1 output). The RSNNS package for R was used to construct the NN, which
was trained using data derived from the evaluation of CASP11 server models. (Similarity scores
were converted back to distances, d, by rearranging the equation for Sr above.)

Global scoring methods:


Global scores were calculated by taking the mean per-residue scores (the sum of the per-
residue similarity scores divided by sequence lengths) for each of the 6 individual component
methods, described above, and the NN consensus output (ModFOLD6). Furthermore, 3
additional quasi-single global model quality scores were generated for each model based on the
original ModFOLDclust, ModFOLDclustQ and ModFOLDclust2 global scoring methods 10 (in a
similar vein to the ModFOLD4_single and ModFOLD5_single global scores, tested in CASP10
and CASP11 respectively). Thus, we ended up with 10 alternative global QA scores, which could
be combined in various ways in order to optimise for the different aspects of quality estimation.
We registered 3 ModFOLD6 global scoring variants:
The ModFOLD6 global score (the mean per-residue NN output score) considered alone
was found to have a good balance of performance based on correlations of predicted and
observed scores and rankings of the top models.
The ModFOLD6_cor global score variant (calculated as: (ModFOLDclustQ_single +
DBA_global + ModFOLD6_global)/3) was found to be an optimal combination for producing
good correlations with the observed scores, i.e. the predicted global quality scores produced
should produce closer to linear correlations with the observed global quality scores.
The ModFOLD6_rank global score variant (calculated as:
ModFOLDclustQ_single_res_global + ProQ2_global + CDA_global + DBA_global +
SSA_global + ModFOLD6_global)/6) was found to be an optimal combination for ranking, i.e.
the top ranked models (top 1) should be closer to the highest accuracy, but the relationship
between predicted and observed scores may not be linear.
(Note: the local scores of the 3 ModFOLD6 variants were identical and so only the global
scores (and therefore the ranking of models) differed between the 3 versions.)

Results
The ModFOLD6 server is continuously benchmarked using the CAMEO server11 (identified as
servers 18 & 21). The method has been independently verified to be an improvement on our
previous leading ModFOLD4 & ModFOLD5 methods. At the time of writing ModFOLD6 ranks
among the top servers overall and outperforms all other public servers in the benchmark.

108
Availability
The ModFOLD6 server is available at:
http://www.reading.ac.uk/bioinf/ModFOLD/ModFOLD6_form.html

1. McGuffin,L.J. (2008) The ModFOLD Server for the Quality Assessment of Protein
Structural Models. Bioinformatics.24, 586-587.
2. McGuffin,L.J. (2009) Prediction of global and local model quality in CASP8 using the
ModFOLD server. Proteins. 77, 185-190.
3. McGuffin,L.J., Buenavista,M.T., Roche,D.B. (2013) The ModFOLD4 Server for the Quality
Assessment of 3D Protein Models. Nucleic Acids Res., 41, W368-72.
4. Uziela K, Wallner B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics. 32, 1411-3.
5. Jones,D.T., Singh,T., Kosciolek,T., Tetchner,S. (2015) MetaPSICOV: combining coevolution
methods for accurate prediction of contacts and long range hydrogen bonding in proteins.
Bioinformatics. 31, 999–1006.
6. Buchan, D. W. A., Minneci, F., Nugent, T. C. O., Bryson, K., & Jones, D. T. (2013). Scalable
web services for the PSIPRED Protein Analysis Workbench. Nucleic Acids Res. 41, W349–
W357.
7. Kabsch,W., Sander,C. (1983). Dictionary of protein secondary structure: Pattern recognition
of hydrogen-bonded and geometrical features. Biopolymers. 22, 2577–2637.
8. Ben-David,M., Noivirt-Brik,O., Paz,A., Prilusky,J., Sussman,J.L., Levy,Y. (2009)
Assessment of CASP8 structure predictions for template free targets. Proteins. 77 S9, 50-65.
9. Jones,D.T., Cozzetto,D. (2015) DISOPRED3: precise disordered region predictions with
annotated protein-binding activity. Bioinformatics. 31, 857–863.
10. McGuffin,L.J., Roche,D.B. (2010) Rapid model quality assessment for protein structure
predictions using the comparison of multiple models without structural alignments.
Bioinformatics. 26, 182-188.
11. Haas,J., Roth,S., Arnold,K., Kiefer,F., Schmidt,T., Bordoli,L. & Schwede,T. (2013) The
Protein Model Portal- a comprehensive resource for protein structure and model information.
Database (Oxford). bat031.

109
ModFOLDclust2

Automated 3D Model Quality Assessment using ModFOLDclust2

L.J. McGuffin1, D.B. Roche2,3


1 - School of Biological Sciences, University of Reading, Reading, UK, 2 - Institut de Biologie Computationnelle ,
LIRMM, CNRS-UMR 5506, Université de Montpellier, Montpellier, France, 3 - Centre de Recherche en Biologie
cellulaire de Montpellier, CNRS-UMR 5237, Montpellier, France
l.j.mcguffin@reading.ac.uk

The ModFOLDclust2 method1 is a leading automatic clustering based approach for both local
and global 3D model quality assessment2.

Methods
The ModFOLDclust2 server tested during CASP12 was identical to that tested during the
CASP9, CASP10 & CASP11 experiments. The ModFOLDclust2 method was originally
developed to provide increased prediction accuracy, over the original ModFOLDclust
method3,4, with minimal additional computational overhead. The global QA score from
ModFOLDclust2 is simply the mean of the global QA scores obtained from the
ModFOLDclustQ method and the original ModFOLDclust method. ModFOLDclustQ is similar
to our previous ModFOLDclust method, however a modified version of the structural alignment
free Q-measure5 is used instead of the TM-score6 in order to carry out all-against-all pairwise
model comparisons. The per-residue QA scores for ModFOLDclust2 were just taken directly
from ModFOLDclust, as no advantage was gained from simply combining the per-residue scores
with those from ModFOLDclustQ.

Results
ModFOLDclust2 has been independently evaluated by the CASP assessors since CASP9 and has
consistently ranked among the top performing QA methods2,7,8.

Availability
ModFOLDclust2 can be run as an option via the ModFOLD server (version 3.0):
http://www.reading.ac.uk/bioinf/ModFOLD/ModFOLD_form_3_0.html
The ModFOLDclust2 software is also available to download as a standalone program via:
http://www.reading.ac.uk/bioinf/downloads/

1. McGuffin,L.J., Roche,D.B. (2010) Rapid model quality assessment for protein structure
predictions using the comparison of multiple models without structural alignments.
Bioinformatics. 26, 182-188.
2. Kryshtafovych,A., Fidelis,K., Tramontano,A. (2011) Evaluation of model quality predictions
in CASP9. Proteins. 79 S10, 91-106.
3. McGuffin,L.J (2007) Benchmarking consensus model quality assessment for protein fold
recognition. BMC Bioinformatics. 8, 345.
4. McGuffin,L.J. (2009) Prediction of global and local model quality in CASP8 using the

110
ModFOLD server. Proteins. 77 S9, 185-190.
5. Ben-David,M., Noivirt-Brik,O., Paz,A., Prilusky,J., Sussman,J.L. and Levy,Y. (2009)
Assessment of CASP8 structure predictions for template free targets, Proteins. 77 S9, 50-65.
6. Zhang,Y., Skolnick,J. (2004) Scoring function for automated assessment of protein structure
template quality. Proteins. 57, 702-710.
7. Kryshtafovych,A., Barbato,A., Fidelis,K., Monastyrskyy,B., Schwede,T., Tramontano,A.
(2013) Assessment of the assessment: evaluation of the model quality estimates in CASP10.
Proteins. 82 S2, 112-26.
8. Kryshtafovych,A., Barbato,A., Monastyrskyy,B., Fidelis,K., Schwede,T., Tramontano,A.
(2015) Methods of model accuracy estimation can help selecting the best models from decoy
sets: Assessment of model accuracy estimations in CASP11. Proteins. doi:
10.1002/prot.24919 [Epub ahead of print].

111
MUFOLD

Protein structure prediction and refinement using MUfold by sampling CASP decoys

Hongbo Li1,2,3, Junlin Wang3, Yi Shang3, and Dong Xu2,3


1-School of Computer Science and Information Technology, NorthEast Normal University, Changchun, 130117,
China, 2-Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, 65211, USA, 3-
Department of Computer Science, University of Missouri, Columbia, Missouri, 65211, USA.

xudong@missouri.edu

We propose a method to predict and refine protein structural models by sampling decoys
predicted by CASP servers. We assemble similar models from CASP servers and use them as
templates to generate new models. We have used a QA strategy that combines a consensus
method and some single-model scores to select good candidates from the model pool iteratively.

Methods
For each target, the inputs are all the predicated models and the outputs are 5 good refined
models. Our method contains the following 5 steps:
1. Filtering redundant models from the initial pool.
2. Selecting good starting models.
3. Iteratively refining single model.
4. Iteratively refining a combination of two models.
5. Selecting good candidates from the new models by a QA strategy.
Here are some details of each step:
Step 1. Some models in the pool may be very similar to each other or even identical. We
calculate the pairwise GDT-TS between the input models and filter out redundant models using a
GDT-TS cutoff of 0.99.
Step 2. We select 12 good starting models. We use a Linear Regression model trained
with the CASP8, 9, 10 data and MUfold-QA, which includes a composite of various QA methods
such as Proq2[1] to rank the models. The top 5 models of the Linear Regression method are
selected and MUfold-QA is used to select the remaining 7 models from the pool.
Step 3. For each starting model sm, we find those models that are most similar to sm and
use them (including sm) as templates to build new models by MUfold and Modeller[2]. A scoring
function SC in MUfold-QA is defined to measure the quality of new models. According to SC, if
a new model nm is better than sm, then we use nm as current sm and repeat the procedure. The
iteration ends when no new model is better than current sm and the candidate will be the one
with the highest SC score.
Step 4. For every pair of starting models sm1 and sm2, we find those models that are most
similar to them and use these models as templates to build new models by MUfold and Modeller.
At the end of each round, if a new model nm is better than the starting models either of sm1 or
sm2, then we use nm to replace it and repeat the procedure. The iteration ends when no new
model is better than sm1 or sm2. The candidate will be the one with the highest SC score.
Step 5. We use the SC score to sort all the candidates of Steps 3 and 4. The best 5 models
will be the final output.

112
Scoring function. The filtered input decoys are used as the reference set ref. Given a
predicted model mi, the scoring function is defined as follow,
SC(mi)=3*norm(cus(mi))+norm(dfire(mi))+norm(opusca(mi))+norm(cheng(mi))
𝑠𝑖 −𝑠𝑚𝑖𝑛
norm is normalization in the range of 0-1, norm(si)=𝑠 , where smax and smin are the
𝑚𝑎𝑥 −𝑠𝑚𝑖𝑛
maximum and minimum of score s of the models in ref.
∑𝑚 ∈𝑟𝑒𝑓 𝐺𝐷𝑇−𝑇𝑆(𝑚𝑖 , 𝑚𝑗 )
𝑗
cus(mi) = |𝑟𝑒𝑓|
The scoring function of dfire, opusca and cheng are defined in [3, 4, 5] respectively.

Results
We have tested this method on the all group targets of CASP11. Compared with the method we
used in CASP11, the average GDT-TS has been improved by approximately 0.03.

Reference
1. Ray A., Lindahl E. and Wallner B. (2012). Improved model quality assessment using ProQ2.
BMC Bioinformatics, 13, 1567-1587.
2. Sali A., Blundell T. L. (1993). Comparative protein modelling by satisfaction of spatial
restraints. J Mol Biol. 234, 779–815.
3. Zhou H., Zhou Y. (2002). Distance-scaled, finite ideal-gas reference state improves structure-
derived potentials of mean force for structure selection and stability prediction. Protein Sci.
11, 2714–2726.
4. Wu Y., Lu M., Chen M., Li J., Ma J. (2007). OPUSCa: a knowledge-based potential function
requiring only Ca positions. Protein Sci. 16, 1449–1463.
5. Wang Z., Tegge A., Cheng J. (2009). Evaluating the absolute quality of a single protein
model using structural features and support vector machines. Proteins: Struct Funct
Bioinformatics.75, 638–647.

113
MUfoldQA_C

A 2-Stage Consensus Quality Assessment Method for Protein Structure Prediction

Wenbo Wang1, Dong Xu1,2, and Yi Shang1


1-Department of Computer Science, University of Missouri, Columbia, Missouri, 65211, USA, 2 - Christopher S.
Bond Life Sciences Center, University of Missouri, Columbia, Missouri, 65211, USA
wwr34@mail.missouri.edu

We propose a new 2-stage consensus algorithm to assess the quality of a predicted model for a
given target based on the structures of similar proteins (templates) and a pool of reference
models. Stage 1 evaluates the quality of each c-alpha position of the reference models based on
their similarity to the structures of proteins similar to the target protein found in the PDB
database. Stage 2 evaluates the quality of the given predicted model based on its similarity to the
reference models and the quality of the reference models.

Methods
The input is a target protein sequence (Q) of length n (n c-alpha atoms), a predicted model to be
scored (M), and r reference models (𝑅𝑖 , 𝑖 = 1, 𝑟, which could be other predicted models). The
output is a quality score of the model in the range of 0 and 1, with 1 being the highest quality -
the same as its native structure. The method consists of the following 5 steps:
Step 1. Use Q to search in the PDB database with Blast1 and HHsearch2, respectively, to
find similar proteins as templates. Sort them based on scores calculated using the following
formula, then choose the top 10 of Blast templates and HHsearch templates separately to form a
set of 20 templates (T).
𝑆𝑜𝑟𝑡𝑆𝑐𝑜𝑟𝑒 = (3 − log10 𝐸) ∙ 𝐼 ∙ 𝐶
where E-value (E) and the percentage of identical sequences (I) are returned by Blast or
HHsearch and cover rate (C) refers to the ratio of the length of template sequence to the length of
target sequence.
Step 2. For each reference model 𝑅𝑖 , call subroutine Evaluate(Q, 𝑅𝑖 , T) to evaluate the
quality of 𝑅𝑖 and generate a weight array 𝐻𝑖𝑗 , 𝑗 = 1, 𝑛, consisting of a weight for each c-alpha
position in 𝑅𝑖 .
Step 3. Sort reference models by the average of all elements in its weight array, and
choose up to 100 top reference models (𝑅𝑖 , i=1, v. v ≤ 100)
Step 4. Calculate GDT-TS between M and each top reference model 𝑅𝑖 . Let 𝐺𝑖 , 𝑖 = 1, 𝑣,
represent the GDT-TS value vector.
Step 5. For each c-alpha position in M, calculate the weighted average of 𝐺𝑖 using weight
𝐻𝑖𝑗 . Then, the final model score is the simple average of all c-alpha position scores:
𝑛
1 ∑𝑣𝑖=1 𝐻𝑖𝑗 𝐺𝑖
𝑆𝑐𝑜𝑟𝑒 = ∑ 𝑣
𝑛 ∑𝑖=1 𝐻𝑖𝑗
𝑗=1

Subroutine Evaluate(Q, 𝑅𝑖 , T)

114
The subroutine evaluates the quality of a 3-D model 𝑅𝑖 of a protein sequence Q of length n based
on a set of 20 templates T, and generate a weight array 𝐻𝑖𝑗 , 𝑗 = 1, 𝑛, consisting of a weight for
each c-alpha position in 𝑅𝑖 . It has the following 3 major steps:
Step 1. Calculate GDT-TS value (𝑆𝑘 , 𝑘 = 1, 20) between the reference model and each of
the 20 templates in T.
Step 2. For each of the 20 templates, compare its c-alpha sequence with the c-alpha
sequence of Q. For each pair of c-alpha atoms in the corresponding position of the two
sequences, retrieve the BLOSUM3 value of them (B) and use the following the formula to
calculate a heuristic weight, 𝑊𝑘𝑗 ,
𝑊𝑘𝑗 = 2𝐵
where 𝑘 ∈ 𝑚𝑗 . mj is the set of indices of the templates that have valid value at that c-alpha
position, and 𝑗 = 1, 𝑛.
Step 3. For each c-alpha position in 𝑅𝑖 , calculate the weighted average score (𝐻𝑖𝑗 ) using
all templates with valid value at that position.
∑𝑘∈𝑚𝑗 𝑊𝑘𝑗 𝑆𝑘
𝐻𝑖𝑗 = (𝑗 ∈ [1, 𝑛])
∑𝑘∈𝑚𝑗 𝑊𝑘𝑗
where mj is the set of indices of the templates that have valid value at that c-alpha position.

Results
We tested this method on CASP11 targets for both CASP stage 1 QA task (83 targets each with
up to 20 models) and stage 2 QA task (81 targets each with up to 150 models). Its results of
Pearson correlation and top 1 GDT-TS loss are compared with those of the naïve consensus
method in the following table. For stage 1 QA task, the new method outperforms naïve
consensus by 0.038 on Pearson correlation and 0.02 on top 1 GDT-TS loss. It is also significantly
better than naïve consensus on the stage 2 QA task.

Stage 1 QA task Stage 2 QA task


Pearson Top 1 GDT-TS Pearson Top 1 GDT-TS
Correlation loss Correlation loss
Naïve Consensus 0.8018 0.0523 0.5730 0.0736
2-Stage Consensus 0.8396 0.0336 0.5935 0.0667

1. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. & Lipman,D.J.


(1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389-3402.
2. Soding, J. (2004). Protein homology detection by HMM-HMM comparison. Bioinformatics,
21, 951–960.
3. Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences, 89, 10915–10919.

115
MUfoldQA_S

A Template-Based Quality Assessment Method for Protein Structure Prediction

Wenbo Wang1, Son Nguyen1, Dong Xu1,2, and Yi Shang1


1-Department of Computer Science, University of Missouri, Columbia, Missouri, 65211, USA, 2 - Christopher S.
Bond Life Sciences Center, University of Missouri, Columbia, Missouri, 65211, USA
wwr34@mail.missouri.edu

We propose a single-model QA method to assess the quality of a model of a target protein based
on the structures of proteins similar to the target protein. These similar proteins, called templates,
are found from the PDB database by using sequence search. We calculate the pairwise GDT-TS
between the input model and the templates. Then, for each c-alpha position of the model, a score
is calculated as the weighted average of the template GDT-TS values, weighted by a BLOSUM1-
based heuristic. Finally, the model score is the average of all c-alpha position scores.

Methods
The input is a target protein sequence and a predicted model, and the output is a quality score of
the model in the range of 0 and 1, with 1 being the highest quality - the same as its native
structure. The method consists of the following 4 major steps:
1. Search PDB database using Blast2 and HHsearch3 to find up to 20 similar proteins,
i.e., the templates.
2. Calculate GDT-TS value between the model and each template.
3. For each template, calculate a heuristic weight for each c-alpha position of the
template based on the BLOSUM value of the pair of c-alpha atoms at this position in
the template and target sequence.
4. Calculate the final model score as the average of all c-alpha position scores, which
are weighted GDT-TS values of all available templates for each position.

Here are some details of each step:


Step 1. Use the target protein sequence to search in the PDB database with Blast2 and
HHsearch3, respectively, to find similar proteins as templates. Sort them based on scores
calculated using the following formula, then choose the top 10 of Blast templates and HHsearch
templates, separately:
𝑆𝑜𝑟𝑡𝑆𝑐𝑜𝑟𝑒 = (3 − log10 𝐸) ∙ 𝐼 ∙ 𝐶
where E-value (E) and the percentage of identical sequences (I) are returned by Blast or
HHsearch and cover rate (C) is the ratio of the length of template sequence to the length of target
sequence.
Step 2. Calculate GDT-TS score (𝑆𝑖 , 𝑖 = 1, 20) between the input model and each of the
20 templates.
Step 3. For each of the 20 templates, compare its c-alpha sequence with the c-alpha
sequence of the target protein. For each pair of c-alpha atoms in the corresponding position of the
two sequences, retrieve the BLOSUM value of them (B) and use the following formula to
calculate a heuristic weight, 𝑊𝑖,𝑗 ,
𝑊𝑖,𝑗 = 2𝐵

116
where 𝑖 ∈ 𝑚𝑗 , 𝑚𝑗 is the set of indices of the templates that have valid value at that c-alpha
position, and 𝑗 = 1 𝑡𝑜 𝑛, n is the number of c-alpha atoms in the target protein.

Step 4. For each c-alpha position, calculate the weighted average score of GDT-TS of all
templates with valid value at that position. Then, the final model score is the simple average of
all c-alpha position scores.
𝑛
1 ∑𝑖∈𝑚𝑗 𝑊𝑖,𝑗 𝑆𝑖
𝑆𝑐𝑜𝑟𝑒 = ∑
𝑛 ∑𝑖∈𝑚𝑗 𝑊𝑖,𝑗
𝑗=1
where mj is the set of indices of the templates that have valid value at that c-alpha position and n
is the number of c-alpha atoms in the target protein.

Results
We tested this new method on CASP 11 targets for both CSAP stage 1 QA task (83 targets each
with up to 20 models) and stage 2 QA task (81 targets each with up to 150 models). Its results of
Pearson correlation and top 1 GDT-TS loss are compared with those of the ProQ24 in the
following table. For stage 1 QA task, the new method outperforms ProQ2, one of the best single-
model QA method, by 0.22 on Pearson correlation and 0.08 on top 1 GDT-TS loss. For stage 2
QA task, it is significantly better than ProQ2 on Pearson correlation, but worse on top 1 GDT-TS
loss.

Stage 1 QA task Stage 2 QA task


Pearson Top 1 GDT-TS Pearson Top 1 GDT-TS
Correlation loss Correlation loss
ProQ2 0.6002 0.1200 0.3647 0.0563
Template-Based 0.8198 0.0432 0.4909 0.0883
QA

1. Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences, 89, 10915–10919.
2. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. & Lipman,D.J.
(1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389-3402.
3. Soding, J. (2004). Protein homology detection by HMM-HMM comparison. Bioinformatics,
21, 951–960.
4. Ray A., Lindahl E. and Wallner B. (2012). Improved model quality assessment using ProQ2.
BMC Bioinformatics, 13, 1567-1587

117
MULTICOM

Tertiary Structure Prediction by the MULTICOM Human Group

Renzhi Cao1, Jie Hou2, Debswapna Bhattacharya3, Badri Adhikari2, and Jianlin Cheng2*
1 - Department of Computer Science, Pacific Lutheran University, WA 98447, USA , 2 - Department of Computer
Science, University of Missouri, Columbia, MO 65211, USA, 3 - Department of Electrical Engineering and
Computer Science, Wichita State University, Wichita, KS 67260, USA
*chengji@missouri.edu

A new human predictor MULTICOM was benchmarked in CASP12. In total, 14 different model
quality assessment (QA) methods were used in the model selection step. The raw score based
consensus and Z-score based consensus method were used to combine 14 model lists ranked by
these QA methods. In addition, some new single-model QA methods (i.e. Qprob1 and DeepQA2)
were used as an important indicator for human inspection on human targets. Model combination
and online refinement webserver 3Drefine3 were used to improve quality of predicted models.

Methods
MULTICOM first applied 14 QA methods (including both single and consensus QA methods) to
generate model quality scores for all CASP12 server models of each target. The different QA
scores included Qprob global model quality assessment1, Modfoldclust2 score4, Proq2 score5,
Pcons score6, APOLLO pairwise score7, Model check2 score produced by an improved version
of ModelEvaluator score8, QApro score - a weighted combination of ModelEvaluator and
APOLLO9, a recalibrated SELECTpro energy score, Dope score; DFIRE2 score; OPUS_PSP
score10; Rwplus score11; Model evaluator score12 and RF_CB_SRS_OD score13.
After obtaining quality scores of the models, MULTICOM used two different
combinations of these scores (consensus based on raw scores and Z-scores, respectively) to
generate consensus rankings of models of each target. Then, we manually visualized the top
models selected by the consensus method, with the help of new single-model QA methods
(Qprob and DeepQA) to adjust the ranking of top two models if needed. Next, MULTICOM used
MUFOLD_CL to cluster models, and then selected other three models in different clusters
according to the consensus ranking. Finally, MULTICOM used a model combination
approach14 of combining each selected model with other similar models in the model pool and
3Drefine method to generate refined models for submission.
Results
We preliminarily evaluated the performance of MULTICOM human predictor along with
CASP12 server predictors on 11 CASP12 human targets whose structures were released by the
time of writing this abstract. The sum of Z-scores of the first (i.e. TS1) models predicted by these
predictors for the 11 targets was reported in Table 1. The Z-score of a model was calculated as
the model's GDT-TS score minus the average GDT-TS score of all the models in the model pool
of a target divided by the standard deviation of GDT-TS scores. A negative Z score is converted
to 0 during summation of Z-scores for a predictor.
Table 1. The top 10 predictors ranked based on the summation of the Z scores. MULTICOM and
wfAll-Cheng (for details, see our CASP12 abstract entitle “Tertiary Structure Prediction by

118
wfAll-Cheng”) were human predictors, while all others were server predictors. The 11 targets are
T0859, T0862, T0863, T0864, T0868, T0869, T0870, T0872, T0900, T0904 and T0944.
Sum of Sum of
RANK Predictor name RANK Predictor name
Z-score Z-score
1 wfAll-Cheng (Human) 17.05 6 QUARK 7.88
BAKER-ROSETTA MULTICOM-
2 16.54 7 7.59
SERVER NOVEL
MULTICOM-
3 MULTICOM (Human) 16.30 8 6.94
CLUSTER
4 GOAL 11.05 9 RaptorX 6.15
5 Zhang-Server 8.95 10 RaptorX-Contact 5.87

1. Cao, R. & Cheng, J. Protein single-model quality assessment by feature-based probability


density functions. Sci Rep 6, 23990, doi:10.1038/srep23990 (2016).
2. Cao, R., Bhattacharya, D., Hou, J. & Cheng, J. DeepQA: Improving the estimation of single
protein model quality with deep belief networks. arXiv preprint arXiv:1607.04379 (2016).
3. Bhattacharya, D., Nowotny, J., Cao, R. & Cheng, J. 3Drefine: an interactive web server for
efficient protein structure refinement. Nucleic acids research, gkw336 (2016).
4. McGuffin, L. & Roche, D. Rapid model quality assessment for protein structure predictions
using the comparison of multiple models without structural alignments. Bioinformatics 26
(2010).
5. Ray, A., Lindahl, E. & Wallner, B. Improved model quality assessment using ProQ2. BMC
bioinformatics 13, 1 (2012).
6. Wallner, B. & Elofsson, A. Identification of correct regions in protein models using
structural, alignment, and consensus information. Protein Science 15, 900-913 (2006).
7. Wang, Z. APOLLO: a quality assessment service for single and multiple protein models.
Bioinformatics 27, doi:10.1093/bioinformatics/btr268 (2011).
8. Wang, Z. Evaluating the absolute quality of a single protein model using structural features
and support vector machines. Proteins 75, doi:10.1002/prot.22275 (2009).
9. Cao, R., Wang, Z. & Cheng, J. Designing and evaluating the MULTICOM protein local and
global model quality prediction methods in the CASP10 experiment. BMC structural biology
14, 1 (2014).
10. Lu, M., Dousis, A. D. & Ma, J. OPUS-PSP: an orientation-dependent statistical all-atom
potential derived from side-chain packing. Journal of molecular biology 376, 288-301
(2008).
11. Zhang, J. & Zhang, Y. A novel side-chain orientation dependent potential derived from
random-walk reference state for protein fold selection and structure prediction. PloS one 5,
e15386 (2010).
12. Wang, Z., Tegge, A. N. & Cheng, J. Evaluating the absolute quality of a single protein model
using structural features and support vector machines. Proteins: Structure, Function, and
Bioinformatics 75 (2009).
13. Rykunov, D. & Fiser, A. Effects of amino acid composition, finite size of proteins, and
sparse statistics on distance‐dependent statistical pair potentials. Proteins: Structure,
Function, and Bioinformatics 67, 559-568 (2007).
14. Wang, Z. MULTICOM: a multi-level combination approach to protein structure prediction
and its assessments in CASP8. Bioinformatics 26, doi:10.1093/bioinformatics/btq058 (2010).

119
MULTICOM (Ts)

SAXS-Assisted Tertiary Structure Prediction by MULTICOM

Jie Hou1, Badri Adhikari1, Renzhi Cao2, John J.Tanner3 and Jianlin Cheng1*
1 - Department of Computer Science, University of Missouri, Columbia, MO 65211, USA, 2 - Department of
Computer Science, Pacific Lutheran University, WA 98447, USA, 3 - Department of Biochemistry, University of
Missouri, Columbia, MO 65211, USA
* chengji@missouri.edu

MULTICOM aimed to select models that satisfy the experimental SAXS data best. Manual
quality inspection of SAXS data and automated SAXS-assisted model assessment were
performed in our system. MULTICOM server ranked models using a novel target function that
integrates multiple SAXS-related scalars, goodness-of-fitting to the experimental SAXS profile,
pair distance distribution, and predicted local/global structure quality. The selected models were
submitted to Ts category for human group during CASP12.

Methods
All the CASP12 server models of each target and its experimental SAXS data were firstly
collected. Preliminary analysis of small-angle scattering data was conducted through PRIMUS1,
including Guinier approximation of radius of gyration(RG), estimation of pairwise distance
distribution, and quality assessment (e.g., aggregation or not), which provided the basis for
further SAXS-assisted model evaluation.
The experimental SAXS profile was filtered to maximal q value (momentum transfer)
equal to 0.3, and theoretical SAXS profiles of CASP12 models were calculated by FoXS – a fast
method for computing a SAXS profile of a given structure and performing pairwise profile
fitting2. The goodness-of-fitting 𝜒 score between the computed profile and the experimental
SAXS profile was calculated and used to rank all of the models. The RG score of CASP12
models and experimental profiles were estimated from pair distance function (PDF) 𝑃(𝑟) by
GNOM3, where the 𝑃(𝑟) function was generated from an indirect Fourier transform of the
scattering profile. The absolute deviation of RG value between CASP12 models and
experimental profile were calculated to rank the models. The difference between pairwise PDF
distributions was measured by Kullback-Leibler divergence4 to quantify the information loss of
predicted models approximating experimental profile. In addition to SAXS-assisted information,
a single model quality assessment method -- Qprob5 was applied to rank models based on the
local/global quality of structures (e.g., secondary structure, solvent accessibility, energy of global
structure).
Finally, MULTICOM generated 4 lists of scores representing SAXS-fitting satisfaction
and conformation quality, including fitting 𝜒 score, absolute deviation of RG value, Kullback-
Leibler divergence between PDF distribution, and Qprob score. All four scores were converted to
Z-scores, and the sum of Z-scores of each model was calculated to generate the final consensus
ranking. The top 5 models were refined by ModRefiner6 and the local quality score by
ModFOLDclustQ7 were added into models before their submission to CASP.

1. Konarev, P. V., Volkov, V. V., Sokolova, A. V., Koch, M. H. & Svergun, D. I. PRIMUS: a

120
Windows PC-based system for small-angle scattering data analysis. Journal of applied
crystallography 36, 1277-1282 (2003).
2. Schneidman-Duhovny, D., Hammel, M. & Sali, A. FoXS: a web server for rapid computation
and fitting of SAXS profiles. Nucleic acids research 38, W540-W544 (2010).
3. Semenyuk, A. & Svergun, D. GNOM–a program package for small-angle scattering data
processing. Journal of Applied Crystallography 24, 537-540 (1991).
4. Kullback, S. Information theory and statistics. (Courier Corporation, 1997).
5. Cao, R. & Cheng, J. Protein single-model quality assessment by feature-based probability
density functions. Scientific reports 6 (2016).
6. Xu, D. & Zhang, Y. Improving the physical realism and structural accuracy of protein models
by a two-step atomic-level energy minimization. Biophysical journal 101, 2525-2534 (2011).
7. McGuffin, L. & Roche, D. Rapid model quality assessment for protein structure predictions
using the comparison of multiple models without structural alignments. Bioinformatics 26
(2010).

121
MULTICOM-CLUSTER (TS)

Protein Tertiary Structure Prediction by MULTICOM-CLUSTER Server

Jie Hou1, Jilong Li2, Renzhi Cao3, and Jianlin Cheng1*


1 - Department of Computer Science, University of Missouri, Columbia, MO 65211, US 2 - Omicsoft Corporation,
Cary NC, 27513 USA 3 - Department of Computer Science, Pacific Lutheran University, WA 98447, USA
* chengji@missouri.edu

MULTICOM-CLUSTER server is an automated template-based and template-free approach for


protein tertiary structure prediction. In the CASP12 experiment, we generated an ensemble of
full-length models and domain-based models. Model evaluation and model selection were
applied to identify the most native-like structures.

Methods
The conformation recognition and ensemble system has been shown to be an effective method
for protein structure prediction1. MULTICOM-CLUSTER incorporates (1) multi-aspect
template recognition techniques, in terms of sequence-sequence alignment, sequence-profile
alignment (eg., PSI-BLAST2, BLAST3, SAM4, HMMER5) and profile-profile alignment (eg.,
HHsearch6, HHblits7, PRC8, FFAS9, COMPASS10); (2) combination algorithms of target-
templates alignments, in terms of pairwise sequence similarity and pairwise structural similarity
between targets and templates11; (3) domain recognition, re-modeling, and assembly approach12;
(4) comparative modelling by Modeller13 used to generate structural models from multi-
templates alignments; and (5) template-free modelling tool14 applied to generate structures for
target without reliable templates. For template-based modeling, about 150-200 multi-templates
alignments/models are generated. For free-template modeling, additional dozens of structures
were added into model pool. For the domain-based modeling, the selected top conformations of
all domains were combined into full-length models. In total, around 150-250 models were
collected for quality assessment.
MULTICOM-CLUSTER ranked all models primarily based on APOLLO pairwise
similarity score, and the final top five models were selected by considering the model quality
level, the coverage and identity of templates alignments, e-value of alignments and so on1. All
the domain models were ranked by APOLLO individually, and the top domain models were
combined into full-length models. In order to solve the problem that some low quality models
were accidently ranked in the top, multi-level exception handling was implemented to improve
the quality of top five models15. For example, template-based top one model with low template
coverage and low pairwise model quality score will be replaced with other reasonable models.
Domain-based top one model will be reverted to full-length model if the quality of domain-
assembled model is not satisfied. Unaligned regions of N-terminal or C-terminal in the top
models were filled with other template-based models.

Results
We evaluated MULTICOM-CLUSTER on 20 CASP12 targets whose experimental structures
were released to date. In Table 1 we report the average GDT-TS scores and TM-scores of top 1
and best of top 5 models.

122
Table 1. The average GDT-TS scores and TM-scores of top one and best of five models on 20
CASP12 targets. These targets are T0859-T0865, T0868, T0869, T0870, T0872, T0879, T0891,
T0900, T0902, T0903, T0904, T0920, T0943, and T0944.
Top One Best of Five
Predictor
GDT-TS TM-score GDT-TS TM-score
MULTICOM-CLUSTER 0.49 0.55 0.51 0.57

1. Li, J., Deng, X., Eickholt, J. & Cheng, J. Designing and benchmarking the MULTICOM
protein structure prediction system. BMC Structural Biology 13, 1-14, doi:10.1186/1472-
6807-13-2 (2013).
2. Altschul, S. F. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25, doi:10.1093/nar/25.17.3389 (1997).
3. Altschul, S. Basic local alignment search tool. J Mol Biol 215, doi:10.1016/s0022-
2836(05)80360-2 (1990).
4. Hughey, R. & Krogh, A. SAM: Sequence alignment and modeling software system. (1995).
5. Finn, R. D. HMMER web server: interactive sequence similarity searching. Nucleic Acids
Res 39, doi:10.1093/nar/gkr367 (2011).
6. Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951-
960 (2005).
7. Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein
sequence searching by HMM-HMM alignment. Nature methods 9, 173-175 (2012).
8. Madera, M. PRC–The profile comparer, PhD thesis, (2006).
9. Jaroszewski, L., Rychlewski, L., Li, Z., Li, W. & Godzik, A. FFAS03: a server for profile–
profile sequence alignments. Nucleic acids research 33, W284-W288 (2005).
10. Sadreyev, R. & Grishin, N. COMPASS: a tool for comparison of multiple protein alignments
with assessment of statistical significance. Journal of molecular biology 326, 317-336
(2003).
11. Cheng, J. A multi-template combination algorithm for protein comparative modeling. BMC
Struct Biol 8, doi:10.1186/1472-6807-8-18 (2008).
12. Cheng, J., Eickholt, J., Wang, Z. & Deng, X. Recursive protein modeling: a divide and
conquer strategy for protein structure prediction and its case study in CASP9. Journal of
bioinformatics and computational biology 10, 1242003 (2012).
13. Sali, A. & Blundell, T. Comparative protein modelling by satisfaction of spatial restraints.
Protein structure by distance analysis 64, C86 (1994).
14. Leaver-Fay, A. ROSETTA3: an object-oriented software suite for the simulation and design
of macromolecules. Methods Enzymol 487, doi:10.1016/b978-0-12-381270-4.00019-6
(2011).
15. Li, J., Cao, R. & Cheng, J. A large-scale conformation sampling and evaluation server for
protein tertiary structure prediction and its assessment in CASP11. BMC Bioinformatics 16,
1-11, doi:10.1186/s12859-015-0775-x (2015).

123
MULTICOM-CLUSTER (QA)

Protein Model Quality Assessment by MULTICOM-CLUSTER Server

Renzhi Cao1, Jie Hou2, Debswapna Bhattacharya3, and Jianlin Cheng2*


1 - Department of Computer Science, Pacific Lutheran University, WA 98447, USA, 2 - Department of Computer
Science, University of Missouri, Columbia, MO 65211, USA, 3 - Department of Electrical Engineering and
Computer Science, Wichita State University, Wichita, KS 67260, USA
*chengji@missouri.edu

Our MULTICOM-CLUSTER server predicted global quality of stage1 and stage2 models of
CASP12 targets. MULTICOM-CLUSTER server used a new single-model quality assessment
(QA) method based on deep learning techniques.

Methods
MULITCOM-CLUSTER introduces a novel single model QA method (DeepQA) based on deep
belief network that utilizes 16 selected features describing the quality of a model from different
perspectives. The features include 9 available top-performing energy and knowledge-based
potentials scores, including ModelEvaluator score 1, Dope score 2, RWplus score 3,
RF_CB_SRS_OD score 4, Qprob scores5, GOAP score 6, OPUS score 7, ProQ2 score 8, DFIRE2
score 9. The remaining 7 input features are generated from the physio-chemical properties of a
protein model. These features are calculated from a structural model and its protein sequence 10,
which include: secondary structure similarity (SS) score, solvent accessibility similarity (SA)
score, secondary structure penalty (SP) score, Euclidean compact (EC) score, Surface (SU)
score, exposed mass (EM) score, exposed surface (ES) score.
The deep belief network is trained on several large datasets consisting of models from the
previous CASP experiments, several publicly available datasets, and models generated by our in-
house ab initio method. Based on our benchmark. It outperformed some well-established
methods in selecting good outlier models from a large set of models of mostly low quality
generated by ab initio modeling methods. The source code, executable, document and
training/test datasets of DeepQA for Linux is freely available to non-commercial users
at http://cactus.rnet.missouri.edu/DeepQA/.

Results
We preliminarily evaluated the global quality assessment performance of MULTICOM-
CLUSTER on 20 CASP12 targets whose structures were released by the time of writing this
abstract. We used two metrics (i.e. average per target correlation and average per target loss) to
assess the quality scores predicted by our server against the real quality scores. The loss for each
target is calculated by the GDT-TS score difference between top 1 model ranked by predicted
scores and the overall best model in the model pool. The results are reported in Table 1.

124
Table 1. The average per-target correlation score and average loss for global quality assessment
of stage1 and stage2 models. The 20 targets are: T0859-T0865, T0868, T0869, T0870, T0872,
T0879, T0891, T0900, T0902, T0903, T0904, T0920, T0943 and T0944.
Ave. Ave. Ave.
Ave. Loss.
Server name Corr. Corr. Loss. Number Of Target
Stage2
Stage1 Stage2 Stage1
MULTICOM-
0.71 0.56 0.06 0.09 20
CLUSTER

1. Wang, Z., Tegge, A. N. & Cheng, J. Evaluating the absolute quality of a single protein model
using structural features and support vector machines. Proteins 75, 638-647,
doi:10.1002/prot.22275 (2009).
2. Shen, M. y. & Sali, A. Statistical potential for assessment and prediction of protein
structures. Protein science 15, 2507-2524 (2006).
3. Zhang, J. & Zhang, Y. A novel side-chain orientation dependent potential derived from
random-walk reference state for protein fold selection and structure prediction. PloS one 5,
e15386 (2010).
4. Rykunov, D. & Fiser, A. Effects of amino acid composition, finite size of proteins, and
sparse statistics on distance‐dependent statistical pair potentials. Proteins: Structure,
Function, and Bioinformatics 67, 559-568 (2007).
5. Cao, R. & Cheng, J. Protein single-model quality assessment by feature-based probability
density functions. Scientific Reports 6, 23990, doi:10.1038/srep23990 (2016).
6. Zhou, H. & Skolnick, J. GOAP: a generalized orientation-dependent, all-atom statistical
potential for protein structure prediction. Biophysical journal 101, 2043-2052 (2011).
7. Wu, Y., Lu, M., Chen, M., Li, J. & Ma, J. OPUS‐Ca: A knowledge‐based potential function
requiring only Cα positions. Protein Science 16, 1449-1463 (2007).
8. Uziela, K. & Wallner, B. ProQ2: Estimation of Model Accuracy Implemented in Rosetta.
Bioinformatics, btv767 (2016).
9. Yang, Y. & Zhou, Y. Specific interactions for ab initio folding of protein terminal regions
with secondary structures. Proteins: Structure, Function, and Bioinformatics 72, 793-803
(2008).
10. Mishra, A., Rao, S., Mittal, A. & Jayaram, B. Capturing native/native like structures with a
physico-chemical metric (pcSM) in protein folding. Biochimica et Biophysica Acta (BBA)-
Proteins and Proteomics 1834, 1520-1531 (2013).

125
MULTICOM-CONSTRUCT (TS)

Protein Tertiary Structure Prediction by MULTICOM-CONSTRUCT Server

Jie Hou1, Jilong Li2, Renzhi Cao3, and Jianlin Cheng1*


1 - Department of Computer Science, University of Missouri, Columbia, MO 65211, USA 2 - Omicsoft Corporation,
Cary NC, 27513 USA, 3 - Department of Computer Science, Pacific Lutheran University, WA 98447, USA
* chengji@missouri.edu

Our server MULTICOM-CONSTRUCT participated in the tertiary structure prediction (TS)


category of the CASP12 experiment. This method consisted of large-scale conformation
sampling and comprehensive model evaluation.

Methods
MULTICOM-CONSTRUCT aims to integrate the large-scale conformation ensemble
protocols1,2 with diverse model quality assessment methods3,4 to improve the protein structure
prediction.
In the first step, a variety of template recognition tools and template-free modeling tools
were used to generate an ensemble of models approximating the native structure of targets. The
process of generating the alignments and full-length models follows the same protocols as
MULTICOM-CLUSTER1,2.
In the second step, the two-level of model quality consensus ranking was applied to
evaluate the ensemble of models. Totally 150-250 models were firstly evaluated by a single
model quality assessment tool -- ModelEvaluator5, a fully pairwise model QA tool -- APOLLO6,
and a structural-based protein energy evaluation tool -- SELECTpro7. The consensus ranking of
models was generated. Then the top 80 models from the consensus ranking were fed into a large-
scale QA system consists of 14 complementary quality assessment methods3, including single-
model QA methods (eg., Qprob8, ProQ29, ModelCheck2 which is an improved version of
ModelEvaluator5, SELECTPro7, Dope10, DFIRE211, OPUS_PSP12, Rwplus13, ModelEvaluator5,
RF_CB_SRS_OD14) and multi-model QA methods (eg., ModFOLDclust215, Pcons16, APOLLO6,
QApro17). Then the top 5 models were selected from this new ranking and processed by our
multi-level exception-handling algorithm1 to remove low quality models.

Results
We evaluated MULTICOM-CONTRUCT on 20 CASP12 targets whose experimental structures
were released to date. In Table 1 we reports the average GDT-TS scores and TM-scores of top 1
and best of top 5 models.
Table 1. The average GDT-TS scores and TM-scores of top one and best of five models on 20
CASP12 targets. These targets are T0859-T0865, T0868, T0869, T0870, T0872, T0879, T0891,
T0900, T0902, T0903, T0904, T0920, T0943 and T0944.
Top One Best of Five
Predictor
GDT-TS TM-score GDT-TS TM-score
MULTICOM-
0.49 0.55 0.51 0.58
CONSTRUCT

126
1. Li, J., Cao, R. & Cheng, J. A large-scale conformation sampling and evaluation server for
protein tertiary structure prediction and its assessment in CASP11. BMC Bioinformatics
16, 1-11, doi:10.1186/s12859-015-0775-x (2015).
2. Wang, Z. MULTICOM: a multi-level combination approach to protein structure
prediction and its assessments in CASP8. Bioinformatics 26,
doi:10.1093/bioinformatics/btq058 (2010).
3. Cao, R., Bhattacharya, D., Adhikari, B., Li, J. & Cheng, J. Large-scale model quality
assessment for improving protein tertiary structure prediction. Bioinformatics 31, i116-
i123 (2015).
4. Cao, R., Bhattacharya, D., Adhikari, B., Li, J. & Cheng, J. Massive integration of diverse
protein quality assessment methods to improve template based modeling in CASP11.
Proteins: Structure, Function, and Bioinformatics (2015).
5. Wang, Z. Evaluating the absolute quality of a single protein model using structural
features and support vector machines. Proteins 75, doi:10.1002/prot.22275 (2009).
6. Wang, Z. APOLLO: a quality assessment service for single and multiple protein models.
Bioinformatics 27, doi:10.1093/bioinformatics/btr268 (2011).
7. Randall, A. & Baldi, P. SELECTpro: effective protein model selection using a structure-
based energy function resistant to BLUNDERs. BMC Struct Biol 8, doi:10.1186/1472-
6807-8-52 (2008).
8. Cao, R. & Cheng, J. Protein single-model quality assessment by feature-based probability
density functions. Scientific reports 6 (2016).
9. Ray, A., Lindahl, E. & Wallner, B. Improved model quality assessment using ProQ2.
BMC bioinformatics 13, 1 (2012).
10. Shen, M. y. & Sali, A. Statistical potential for assessment and prediction of protein
structures. Protein science 15, 2507-2524 (2006).
11. Yang, Y. & Zhou, Y. Ab initio folding of terminal segments with secondary structures
reveals the fine difference between two closely related all‐atom statistical energy
functions. Protein science 17, 1212-1219 (2008).
12. Lu, M., Dousis, A. D. & Ma, J. OPUS-PSP: an orientation-dependent statistical all-atom
potential derived from side-chain packing. Journal of molecular biology 376, 288-301
(2008).
13. Zhang, J. & Zhang, Y. A novel side-chain orientation dependent potential derived from
random-walk reference state for protein fold selection and structure prediction. PloS one
5, e15386 (2010).
14. Rykunov, D. & Fiser, A. Effects of amino acid composition, finite size of proteins, and
sparse statistics on distance‐dependent statistical pair potentials. Proteins: Structure,
Function, and Bioinformatics 67, 559-568 (2007).
15. McGuffin, L. J. & Roche, D. B. Rapid model quality assessment for protein structure
predictions using the comparison of multiple models without structural alignments.
Bioinformatics 26, 182-188 (2010).
16. Wallner, B. & Elofsson, A. Identification of correct regions in protein models using
structural, alignment, and consensus information. Protein Science 15, 900-913 (2006).
17. Cao, R., Wang, Z. & Cheng, J. Designing and evaluating the MULTICOM protein local
and global model quality prediction methods in the CASP10 experiment. BMC structural
biology 14, 1 (2014).

127
MULTICOM-NOVEL (TS)

Improved integration of template-based and template-free model sampling methods for


Protein Structure Prediction

Jie Hou1, Jilong Li2, Badri Adhikari1, and Jianlin Cheng1*


1 - Department of Computer Science, University of Missouri, Columbia, MO 65211, USA, 2 - Omicsoft Corporation,
Cary NC, 27513 USA
*chengji@missouri.edu

In CASP12, we tested an improved integration of template-based and template-free model


sampling methods to improve tertiary structure prediction. Our MULTICOM-NOVEL server
used a variety of target-template alignment tools, model generation tools and a new strategy for
domain recognition and domain assembly. In addition, a large-scale model quality assessment
method was used for high-quality model selection.

Methods
We first searched the sequence/profile of target against the in-house template database, and
generate pairwise target-template alignments using a variety of template recognition tools
including HHsearch1, COMPASS2, HMMER3, SAM4, PRC5, PSI-Blast6, CSBlast7, CSIBlast7,
MUSTER8, and Raptorx9.
We generated template-based models from target-template alignments using a
probabilistic template-based model generation tool MTMG10 and a comparative modelling tool
Modeller11. For a target, all significant templates were utilized to detect any template-free
regions, and a target was split into template-based and template-free domains. The template-free
domains were modeled using contact-guided protein folding tool CONFOLD12, full-length ab
initio modelling tool ROSETTA13, and remote homology detection tools like MRFalign14,
DNfold15, and RFfold16, which search the sequence against the SCOP domains. If no significant
templates were found, we ran ROSETTA and I-TASSER17 on the full-length target.
All the template-based, template-free, and domain-combined models were evaluated by
our in-house large-scale model quality consensus ranking tool18 to select high quality protein
structural models. Specifically, the top ranked domains were assembled iteratively to generate
full-length structures with minimum energy. After we got the top ranked models, exception
handling was applied to those high ranked template-based models which have long missing-
aligned regions, where the missing-aligned regions were filled by most significant templates that
covered same regions. The top 5 models were refined by ModRefiner19 for submission.

Results
We evaluated MULTICOM-NOVEL on 20 CASP12 targets whose experimental structures were
released to date. In Table 1 we report our preliminary evaluation showing the average GDT-TS
scores and TM-scores of top 1 and best of top 5 models.

Table 1. The average GDT-TS and TM-scores of top one and best of five models on 20 CASP12
targets predicted by MULTICOM-NOVEL. These targets include T0859-T0865, T0868, T0869,
T0870, T0872, T0879, T0891, T0900, T0902, T0903, T0904, T0920, T0943 and T0944.

128
Top One Best of Five
Predictor
GDT-TS TM-score GDT-TS TM-score
MULTICOM-NOVEL 0.49 0.56 0.52 0.59

1. Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951-


960 (2005).
2. Sadreyev, R. & Grishin, N. COMPASS: a tool for comparison of multiple protein alignments
with assessment of statistical significance. Journal of molecular biology 326, 317-336
(2003).
3. Finn, R. D. HMMER web server: interactive sequence similarity searching. Nucleic Acids
Res 39, doi:10.1093/nar/gkr367 (2011).
4. Hughey, R. & Krogh, A. SAM: Sequence alignment and modeling software system. (1995).
5. Madera, M. PRC–The profile comparer, PhD thesis, (2006).
6. Altschul, S. F. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25, doi:10.1093/nar/25.17.3389 (1997).
7. Biegert, A. & Söding, J. Sequence context-specific profiles for homology searching. Proc
Natl Acad Sci 106, doi:10.1073/pnas.0810767106 (2009).
8. Wu, S. & Zhang, Y. MUSTER: improving protein sequence profile–profile alignments by
using multiple sources of structure information. Proteins: Structure, Function, and
Bioinformatics 72, 547-556 (2008).
9. Peng, J. & Xu, J. RaptorX: exploiting structure information for protein alignment by
statistical inference. Proteins: Structure, Function, and Bioinformatics 79, 161-171 (2011).
10. Li, J. & Cheng, J. A Stochastic Point Cloud Sampling Method for Multi-Template Protein
Comparative Modeling. Scientific reports 6 (2016).
11. Sali, A. & Blundell, T. Comparative protein modelling by satisfaction of spatial restraints.
Protein structure by distance analysis 64, C86 (1994).
12. Adhikari, B., Bhattacharya, D., Cao, R. & Cheng, J. CONFOLD: residue‐residue contact‐
guided ab initio protein folding. Proteins: Structure, Function, and Bioinformatics 83, 1436-
1449 (2015).
13. Leaver-Fay, A. ROSETTA3: an object-oriented software suite for the simulation and design
of macromolecules. Methods Enzymol 487, doi:10.1016/b978-0-12-381270-4.00019-6
(2011).
14. Ma, J., Wang, S., Wang, Z. & Xu, J. MRFalign: protein homology detection through
alignment of Markov random fields. PLoS Comput Biol 10, e1003500 (2014).
15. Jo, T., Hou, J., Eickholt, J. & Cheng, J. Improving protein fold recognition by deep learning
networks. Scientific reports 5 (2015).
16. Jo, T. & Cheng, J. Improving protein fold recognition by random forest. BMC bioinformatics
15, S14 (2014).
17. Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein
structure and function prediction. Nat Protoc 5 (2010).
18. Cao, R., Bhattacharya, D., Adhikari, B., Li, J. & Cheng, J. Large-scale model quality
assessment for improving protein tertiary structure prediction. Bioinformatics 31, i116-i123
(2015).
19. Xu, D. & Zhang, Y. Improving the physical realism and structural accuracy of protein
models by a two-step atomic-level energy minimization. Biophysical journal 101, 2525-2534
(2011).

129
MULTICOM-NOVEL, MULTICOM-CONSTRUCT, MULTICOM-CLUSTER (RR)

Machine Learning, Coevolution-Based and Hybrid Methods for Contact Prediction

B. Adhikari, J. Cheng
Department of Computer Science, University of Missouri, Columbia
chengji@missouri.edu

We participated in the contact prediction (RR) category using three methods. Contact predictions
by our sequence-based deep learning contact predictor DNcon1 were submitted as MULTICOM-
NOVEL. For predicting contacts using coevolution-based approach, we implemented an in-
house method for generating alignments followed by running a coevolution based method to
predict contacts and submitted as MULTICOM-CONSTRUCT. Finally, we use a novel contact
combination approach to combine the two predictions and submitted as MULTICOM-
CLUSTER.

Methods
Our MULTICOM-CONSTRUCT contact predictor relies on our new alignment generation
algorithm to predict contacts with MetaPSICOV2. For coevolution-based contact prediction tools
like MetaPSICOV, coming up with the right size of alignment file is crucial for efficiency. We
develop an alignment generation method that produces at least some sequences whenever
possible even if the quality is not high, and, on the other hand, not have too many sequences for
faster execution even when there are many homologous sequences available. For alignment
generation, we run JackHMMER when the number of alignments produced by HHblits3 is less
than 2.5L, L being the length of the input sequence. We observed that a range of e-values are
required for running JackHMMER4 because for some input protein sequences, stringent e-value
criteria like e-40 produces too few sequences (just a hundred or so) while much lesser stringent
criteria of e-4 produces too many sequences (25K, 50K, etc.). Our algorithm for generating
alignments is summarized below:
T = 2.5
alnsize() computes the ratio of the number of sequences in alignment and L
coverage = 75, 68, 60
for each C in coverage:
run HHblits (with -n 3 -maxfilt 500000 -diff inf -e 0.001 -id 99 -cov C) and get alignment
hhbc
e-value = 40, 30, 20, 10, 4, 0
for each e in e-value:
run JackHMMER (with -N 5 and -E e) and get alignment jhe
# select the right alignment
alignment = hhbcov75 , hhbcov60 , jhe40 , jhe30 , jhe20 , jhe10 , jhe4 , jhe0
for each aln in alignment (in that order):
if alnsize(aln) > T:
accept aln and quit
accept jhe0

130
Our MULTICOM-CLUSTER method is a meta-predictor that combines contacts
predicted by the two other methods – CONSTRUCT and NOVEL, whenever the selected
alignment for CONSTRUCT method is less than 50 in size. Otherwise, the predicted contact
confidence values of the MULTICOM-CONSTRUCT predictions are updated and submitted as
MULTICOM-CLUSTER predictions. As the first step for contact combination, we select top 5L
contacts from each of the two methods, and replace the confidence values with integer numbers,
starting from 5L and ending at 1 for most confident contact prediction to the least confident one.
Then, the confidence scores for the MULTICOM-CLUSTER predicted top 5L contacts is
calculated as the normalized mean of the confidence values from each method, i.e. MCLij =
(MCOij + MNOij)/5L, where MCOij is the confidence of contact pair (i,j) predicted by
MULTICOM-CONSTRUCT method and MNOij is the confidence of contact pair (i,j) predicted
by MULTICOM-NOVEL. The final confidence values are normalized such that top L predicted
contacts have confidence values more than 0.5. Table 1 presents the summary of our preliminary
evaluation of our predictions on targets for which the best predicted model in the CASP released
models has less than 0.5 TM-score, assuming the targets to be potentially free-modeling targets.
MULTICOM-CONSTRUCT and MULTICOM-CLUSTER methods have similar performance,
which is higher than MULTICOM-NOVEL.

Table 1. Precision of contacts predicted by MULTICOM-NOVEL (MNO), MULTICOM-


CONSTRUCT (MCO), and MULTICOM-CLUSTER (MCL) methods based on our preliminary
evaluations. N is the number of sequences in the multiple sequence alignment and Neff is the
effective number of sequences for the same alignment.

Top L/10 Contacts Top L/5 Contacts


Target L N Neff
MCL MCO MNO MCL MCO MNO
T0859 133 2 1 0.0 0.0 0.0 0.0 0.0 0.0
T0862 239 163 23 33.3 33.3 33.3 27.8 27.8 27.8
T0863 670 453 44 1.7 1.7 5.2 1.7 1.7 4.3
T0864 246 526 134 68.2 68.2 40.9 65.9 65.9 36.4
T0869 120 17 12 60.0 60.0 60.0 52.4 42.9 47.6
T0870 138 137 69 25.0 25.0 50.0 16.7 16.7 37.5
T0904 341 23741 147 75.9 75.9 37.9 69.0 69.0 27.6
Avg 270 3577 61 37.7 37.7 32.5 33.3 32.0 25.9

Availability
DNcon is available for download at http://sysbio.rnet.missouri.edu/multicom_toolbox/tools.html.

1. Eickholt, J. and J. Cheng, Predicting protein residue–residue contacts using deep networks
and boosting. Bioinformatics, 2012. 28(23): p. 3066-3072.
2. Jones, D.T., et al., MetaPSICOV: Combining coevolution methods for accurate prediction of
contacts and long range hydrogen bonding in proteins. Bioinformatics, 2014: p. btu791.
3. Eddy, S.R., Profile hidden Markov models. Bioinformatics, 1998. 14(9): p. 755-63.
4. Johnson, L.S., S.R. Eddy, and E. Portugaly, Hidden Markov model speed heuristic and
iterative HMM search procedure. BMC Bioinformatics, 2010. 11: p. 431.

131
MULTICOM-REFINE

De novo protein modeling via stepwise, probabilistic synthesis and assembly of foldon units

Debswapna Bhattacharya1 and Jianlin Cheng2,3,*


1 - Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260-0083,
USA; 2 Department of Computer Science, 3 - Informatics Institute, University of Missouri, Columbia, MO 65211,
USA
* chengji@missouri.edu

In CASP12, we test our recently developed novel generative, probabilistic model (UniCon3D)
that simultaneously captures local structural preferences of backbone and side chain
conformational space of polypeptide chains in a united-residue representation and performs
experimentally motivated conditional conformational sampling via stepwise synthesis and
assembly of foldon units that minimizes a composite physics and knowledge-based energy
function for de novo protein structure prediction. Guided by residue-residue contacts predicted
using a combination of covariation techniques and machine learning method, we employ
simulated annealing energy minimization to generate large-scale decoy pool and submit five
models for each CASP12 target by clustering the ensemble of decoys.

Methods

For each CASP12 targets, we first predict their eight-class secondary structures using SSpro 5
[1] and residue-residue contacts using an in-house contact prediction pipeline based on
MetaPSICOV [2]. In order to strike a balance between accuracy and coverage of predicted
contacts, we select 15 different subsets of predicted contacts by selecting top contacts from 0.2L
to 3L in step size of 0.2L, where L is the length of the target protein and subsequently execute 15
parallel threads of UniCon3D [3] simulations to generate up to 3,000 decoys within 36 hours. In
UniCon3D [3], foldon units are sequentially synthesized and assembled via conditional sampling
from a novel united-residue Input-Output Hidden Markov Model (IOHMM), which captures
local conformational bias of backbone and side chain simultaneously in a united residue
representation. Using a basic implementation of UNRES physics-based force field [4] aided by
knowledge-based information derived from residue–residue contacts, united-residue decoys are
produced by minimizing the potential energy using simulated annealing and identifying the
minimum energy conformation. Each united-residue decoy generated by UniCon3D are then
converted into all-atom level using PULCHRA software [5] and ranked using our recently-
developed single-model based model quality assessment program Qprob [6], which predicts a
decoy’s quality by estimating the errors of structural, physiochemical and energy-based features
using probability density distributions. Next, we use MUFOLD-CL [7] method to cluster the
decoy pool and select top five decoys in different clusters based on their Qprob rankings. Top
five models are further refined using ModRefiner software [8] before submission.

Availability

132
Source code, executable versions, manuals, and example data of UniCon3D for Linux and OSX
are freely available to non-commercial users at http://sysbio.rnet.missouri.edu/UniCon3D/.

References

1. C. N. Magnan, and P. Baldi, “SSpro/ACCpro 5: almost perfect prediction of protein


secondary structure and relative solvent accessibility using profiles, machine learning and
structural similarity,” Bioinformatics, vol. 30, no. 18, pp. 2592-2597, 2014.
2. D. T. Jones, T. Singh, T. Kosciolek, and S. Tetchner, “MetaPSICOV: combining coevolution
methods for accurate prediction of contacts and long range hydrogen bonding in proteins,”
Bioinformatics, vol. 31, no. 7, pp. 999-1006, 2015.
3. D. Bhattacharya, R. Cao, and J. Cheng, “UniCon3D: de novo protein structure prediction
using united-residue conformational search via stepwise, probabilistic sampling,”
Bioinformatics, pp. btw316, 2016.
4. A. Liwo, R. Wawak, H. Scheraga, M. Pincus, and S. Rackovsky, “Prediction of protein
conformation on the basis of a search for compact structures: test on avian pancreatic
polypeptide,” Protein Science, vol. 2, no. 10, pp. 1715-1731, 1993.
5. P. Rotkiewicz, and J. Skolnick, “Fast procedure for reconstruction of full‐atom protein
models from reduced representations,” Journal of computational chemistry, vol. 29, no. 9,
pp. 1460-1465, 2008.
6. R. Cao, and J. Cheng, “Protein single-model quality assessment by feature-based probability
density functions,” Scientific Reports, vol. 6, 2016.
7. J. Zhang, and D. Xu, “Fast algorithm for population‐based protein structural model analysis,”
Proteomics, vol. 13, no. 2, pp. 221-229, 2013.
8. D. Xu, and Y. Zhang, “Improving the physical realism and structural accuracy of protein
models by a two-step atomic-level energy minimization,” Biophysical journal, vol. 101, no.
10, pp. 2525-2534, 2011.

133
myprotein-me, Skwark

Convolutional neural networks for protein contact prediction from coevolution alone.

M.J. Skwark1,2, V. Golkov3 and J. Meiler1,2


1 Department of Chemistry, Vanderbilt University, 2 Center for Structural Biology, Vanderbilt University, 3
Department of Compter Science, Technical University Munich
marcin@skwark.pl

The methods we proposed in CASP12 are entirely contact-driven and differ in the source of
model pool. For myprotein-me, our automated methods, the models came from local
comparative modelling, whereas Skwark – our human method – used the models from both local
modelling and ones submitted by CASP12 prediction servers.
The innovative step of myprotein-me lies in a novel method of predicting contacts -
plmConv. On contrary to currently established methods, which aggregate predictions by multiple
approaches1, 2, plmConv relies on a single multiple sequence alignment and no additional,
external information. By using a deep, convolutional neural network plmConv achieves
predictive performance exceeding one of CONSIP2/MetaPSICOV2 on CASP11 prediction
targets.

Methods
Computational methods3, 4, which attempt to infer evolutionary couplings between residue pairs
in multiple sequence alignment, have proven useful in the field of protein structure prediction, be
it for predicting structural contacts, or for guiding protein structure prediction. This class of
methods is thought to require large amounts of sequence information to achieve satisfactory
predictive performance 5. For most of the protein families with sufficiently deep alignments a
structure of homologous protein is known, obviating the need for coupling inference in structural
context. This need can be (partially) alleviated by introducing external information into the
prediction process, such as predicted secondary structure and solvent accessibility1, 2 or
geometric constraints and knowledge-based contact propensity potential6.
The method we have been testing in CASP, plmConv, relies on the evolutionary coupling
inference of a Potts model, using pseudolikelihood maximization (plmDCA3). In contrast to
other methods, which use post-processed results of inference, we use the inferred parameters of
the underlying Potts model as input to a deep, convolutional neural network. Predictions
submitted for CASP12 relied on jackhmmer7 hits to a UniRef100 database.
Structure prediction for CASP12 has been achieved by constructing an alignment pool,
based on threading results from HHpred8 and LOMETS9. For each of the alignments, we have
built models with MODELLER10. These models have then been refined with ModRefiner{Xu,
2011 #13533} and assessed it with a method described below.
For model quality assessment CASP12 we have computed the summary statistics on the
number of residue pairs that scored high in contact prediction, which were satisfied by assessed
models. We took into account also the mean/median/maximum scores. To enable score
comparison between different targets, we trained a Random Forest, taking into account these
features, as well as a RW+ score, knowledge-based potential, to allow for discerning between
models in case of lack of unequivocal contact prediction. We want to stress, that the MQAP

134
method used in CASP12 expressly does not use external structural information, be it from
templates or comparison to other models.
The human prediction was based on the same premised, but augmented the model pool
with CASP12 server predictions.
Results
Using the CASP11 predictions for 86
proteins as a benchmark, plmConv
achieves an average prediction precision
(PPV) of 0.61 (95% confidence interval:
0.57-0.66). For the same proteins
CONSIP2 has a PPV of 0.54 (95% conf.
int.: 0.50-0.57), with plmConv
outperforming CONSIP2 by δPPV of
0.08 (95% conf, int. 0.05-0.12).
We also want to stress, that on
contrary to CONSIP2, this evaluation
was based on a single alignment source,
with no extra steps taken in case of low-
depth sequence alignments. This being
said, plmConv performs at least on par
to CONSIP2 in low-sequence regime Figure1. Rolling average of PPV (window size 15) as a
and outperforms it for alignments with more function of number of sequences in multiple sequence
sequences. alignment.

1. Skwark MJ, Raimondi D, Michel M, Elofsson A. Improved contact predictions using the
recognition of protein like contact patterns. PLoS computational biology. 2014;10(11):e1003889.
2. Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for
accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics.
2015;31(7):999-1006.
3. Feinauer C, Skwark MJ, Pagnani A, Aurell E. Improving contact prediction along three
dimensions. PLoS computational biology. 2014;10(10):e1003847.
4. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis
of residue coevolution captures native contacts across many protein families. Proceedings of the
National Academy of Sciences of the United States of America. 2011;108(49):E1293-301.
5. Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue
contact predictions in a sequence- and structure-rich era. Proceedings of the National Academy
of Sciences of the United States of America. 2013;110(39):15674-9.
6. Wang Z, Xu J. Predicting protein contact map using evolutionary and physical constraints by
integer programming. Bioinformatics. 2013;29(13):i266-i73.
7. Eddy SR, editor A new generation of homology search tools based on probabilistic inference.
Genome Inform; 2009.
8. Hildebrand A, Remmert M, Biegert A, Söding J. Fast and accurate automatic structure prediction
with HHpred. Proteins: Structure, Function, and Bioinformatics. 2009;77(S9):128-32.
9. Wu S, Zhang Y. LOMETS: a local meta-threading-server for protein structure prediction.
Nucleic acids research. 2007;35(10):3375-82.
10. Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Current protocols
in bioinformatics / editoral board, Andreas D Baxevanis [et al]. 2014;47:5 6 1-5 6 32.

135
Naïve

Protein Contact Prediction with Deep Fully Convolutional Neural Network

Yang Liu, Qing Ye, Jian Peng


University of Illinois, Urbana Champaign
jianpeng@illinois.edu

See iFold_1, iFold_2, Deepfold-Contact, Deepfold-Boom, naïve.

136
Pcomb-domain

Domain-level model quality assessment to improve overall model quality

C. Mirabello, R. Pilstål, and B. Wallner


Linköping University, S-581 83 Linköping, Sweden
bjornw@ifm.liu.se

Traditionally, our consensus MQAP Pcons1 has always used rigid-body superposition for the full-
length models, thereby selecting models that overall have the highest consensus with everything
else. Simply ignoring the fact that smaller domains from other models actually could have a
better consensus over that particular domain. To overcome that problem we developed a domain-
based version of Pcons that based on an initial domain definition runs Pcons for each domain
separately. The domain-based Pcons scores were combined with the local predicted scores from
our single model MQAP ProQ22,3. Two different methods were used to predict the domain
boundaries of the target sequences, the first used the domain definitions from the Robetta4 server
(see BAKER-ROSETTASERVER abstract for details), the second was based on spectral analysis
of the top ranking server models according to the regular Pcomb method (see the Wallner method
abstract for details). The results from these two methods were manually evaluated to decide the
final domain boundaries. The outputs of Pcons and ProQ2 were weighted in a slightly different
way compared the regular Pcomb method, following a parameter optimization based on targets
released in the last two editions of CASP the relative weight for ProQ2 was increased for Pcomb-
domain=0.3*ProQ2-domain+0.7*Pcons-domain, compared to Pcomb=0.2*ProQ2+0.8*Pcons.
Since the ProQ2 scores are local there was no need to rerun ProQ2 for each domain, instead we
could simply sum up the local scores for a particular domain.

Methods
The Pcomb-domain group participated in the TS, and QA categories in CASP12. In TS the
category we used the Pcomb-domain score to assess the local and global quality of all server
models, to selected the best modelled domains. These were then assembled by inserting
fragments in the linker regions guided by inter-domain contact predictions from PconsC3
(http://pconsc3.bioinfo.se) using the remodel protocol in Rosetta.
For the QA category the quality assessment is based on Pcomb-domain score only, as
described above.

Availability
The current version of ProQ2 is currently available as a scoring function in the Rosetta modeling
suite from http://www.rosettacommons.org. Additional scripts to generate all necessary input
files and command lines how to run ProQ2-refine can be found here:
http://github.com/bjornwallner/ProQ_scripts/. Pcomb-domain is available upon request.

1. Wallner, B. & Elofsson, A. Pcons5: combining consensus, structural evaluation and fold
recognition scores. Bioinformatics 21, (2005).
2. Ray, A., Lindahl, E. & Wallner, B. Improved model quality assessment using ProQ2. BMC
Bioinformatics 13, 224 (2012).

137
3 Uziela K, Wallner B. ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics 32(9):1411-3 (2016).
4. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using
the Robetta server. Nucleic Acids Res 32, W526-W531.

138
Pcons, Pcons-net

Improved model quality assessments using Rosetta energy terms and deep learning
1 1 1 2 1
Karolis Uziela , Nanjiang Shu , David Menéndez Hurtado , Björn Wallner and Arne Elofsson
1 - Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University,Stockholm
10691, Sweden2 - Department of Physics, Chemistry and Biology, Linköping University, 581 83 Linköping, Sweden.
arne@bioinfo.se

See ProQ3, ProQ3_1, ProQ3_1_diso, ProQ3_2_diso, RSA_SS_CONS, Pcons, Pcons-net.

139
PconsC2, PconsC3, PconsC31

Improved Contact prediction for small families using PconsC3 combining DCA and non-
DCA methods

Mirco Michel, Nanjiang Shu, David Menéndez Hurtado and Arne Elofsson
Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University,
Stockholm 10691, Sweden
arne@bioinfo.se

Here we participated in the RR category. We applied PconsC21 as published to serve as a


baseline benchmark for our current methods PconsC32 and PconsC31. In contrast to PconsC2,
PconsC3 includes a non-DCA contact prediction method in its input features and uses only a
single alignment as input. This results in much higher prediction quality on small families in our
bechmarks. In PconsC31 we tested whether we are able to select an optimal alignment based on
evaluation of the contact map, thus increasing the performance of our contact predictor compared
to the default choice.

Methods

PconsC2 predicts amino acid contacts based on an ensemble of 8 different alignments and 16
different input contact maps as described in 1. It uses HHblits3 and Jackhmmer4 with varying E-
value thresholds (1, 10-4, 10-10, 10-40) for the alignments and PSICOV5 and PlmDCA6 as input
contact predictions. PconsC3 used HHblits with an E-value of 1 to generate the alignment and
PlmDCA, GaussDCA7, and PhyCMAP8 for input contact predictions. PconsC31 is essentially
PconsC3 run on all PconsC2 input alignments and afterwards selecting the alignment with
allegedly best performance. Our selection was based on the number of sequences in the
alignment and whether it covered most or only a small part of the target sequence.

All three methods use pattern recognition based on multiple layers of random forests. The
first layer takes the input contact maps as well as features on secondary structure, solvent
accessibility, and on the alignment as input. Every following layer takes the same input plus
additional features from the contact map predicted in the previous layer (intermediate contact
map). PconsC2 saturates in performance at layer 5 whereas PconsC3 and PconsC31 do so at
layer 3.

140
Figure 1: PPV for all three methods on 21 targets that could be identified as of writing this abstract. T0865
consists of a single helix with no contacts above 5 residues in sequence separation.

Results
Figure 1 shows positive predictive values (PPV) for 21 targets that could be identified in PDB as
of writing this abstract. It shows that PconsC3 (average PPV 0.44) outperforms PconsC2 (0.35).
With PconsC31 we were able to identify an alignment that led to slightly better results in target
T0891. In target T0920 a Jackhmmer alignment was selected instead of the default, because it
had more sequences and a more balanced coverage. However, it turned out to be worse than the
default HHblits alignment. On average PconsC31 achieves a PPV of 0.44 and is thus so far not
significantly different in performance than PconsC3.

Target T0865 could not be evaluated


because of missing native contacts with more
than 5 residues separation in sequence. Its
structure consists of a single long helix.
Targets T0859 and T0929 are identical and
have just itself in the alignment, no other
homologues could be found. All three of
them were thus excluded from the analysis in
Figure 2.
Figure 2 shows PPV of the predicted
contacts against size of the default HHblits
alignment. Alignment size is measured in
effective sequences defined as in 6. The
running average illustrates improved
performance of PconsC3 and PconsC31 over
PconsC2 especially in small alignments with
Figure 2: Alignment size measured in effective
around 100 effective sequences.
sequences against contact map precision (PPV). Lines
show a running average for each method with a
window size of 5.

Availability
PconsC3 is available as a webserver and downloadable version at http://c3.pcons.net/.

141
1. Skwark,M. J., Raimondi,D., Michel,M., Elofsson,A. (2014) Improved Contact Predictions
Using the Recognition of Protein Like Contact Patterns. PloS Computational Biology, 10(11)
2. Skwark,M.J., Michel,M., Hurtado,D.M., Ekeberg,M., Elofsson,A. (2016) Accurate contact
predictions for thousands of protein families using PconsC3. Submitted
3. Remmert,M., Biegert,A., Hauser,A., Söding,J. (2012) HHblits: lightning-fast iterative protein
sequence searching by HMM-HMM alignment. Nature Methods 9, 173-175
4. Finn,R.D., Clements,J., Eddy,S.R. (2011). HMMER web server: interactive sequence
similarity searching. Nucleic Acids Research 39 (suppl 2): W29-W37
5. Jones,D.T., Buchan,D.W.A., Cozzetto,D., Pontil,M. (2012). PSICOV: precise structural
contact prediction using sparse inverse covariance estimation on large multiple sequence
alignments Bioinformatics 28 (2): 184-190.
6. Ekeberg,M., Lövkvist,C., Lan,Y., Weigt,M., and Aurell,E. (2013) Improved contact
prediction in proteins: Using pseudolikelihoods to infer Potts models. Physical Review E 87,
012707
7. Baldassi,C., Zamparo,M., Feinauer,C., Procaccini,A., Zecchina,R., Weigt,M., Pagnani,A.
(2014) Fast and accurate multivariate Gaussian modeling of protein families: Predicting
residue contacts and protein-interaction partners. PLoS ONE 9(3): e92721.
8. Wang,Z., Xu,J. (2013) Predicting protein contact map using evolutionary and physical
constraints by integer programming. Bioinformatics 29(13): i266–i273.

142
Pcons-net

PconsFold2: Optimising the information from contact prediction in ab initio protein


folding

D. Menéndez Hurtado, Nanjiang Shu, M. Michel and A. Elofsson


Department of Biochemistry and Biophysics and Science for Life Laboratory;Stockholm University; Box 1031; S-
171 21 Solna; Sweden
arne@bioinfo.se

Modern contact prediction methods have very low number of false positives for targets that yield
high quality multiple sequence alignments. In particular we have found that the PconsC3 scores
correlate very well with distance between two residues. Therefore more aggressive restraints can
be used in the folding stage. Here, we present PconsFold2, a protocol that combines the contact
predictions from PconsC31 with Rosetta's ab initio folding2. It is the successor of PconsFold3.

Methods
PconsFold2 uses the standard Rosetta ab
initio folding protocol with additional
constraints for predicted contacts. Only a
limited set of 2000 decoys are generated
for each target.

We have observed that the raw contact


score correlates with the physical distance
between residues, so we can adapt the
distance restraints. Furthermore, since the
false positives predicted by PconsC3
usually correspond to residues that are only
slightly further than the typically used 8 Å
threshold, see Figure 1. In PconsFold2 we
use a harmonic potential to yield a wider
attractive funnel in conformational space.

Another parameter that has to be decided is


Figure 1: Observers residue-residue distance vs how many contacts should be included as
PconsC3 score. The black line moving average with
restraints. Typically, this is set as a fraction
a window size of 1000, the horizontal dashed line a
contact distance of 8Å and the dashed vertical lines of the length of the protein, depending on
indicate N and N/2 cutoffs (left to right). the accuracy of the contact prediction. In
PconsFold we found that between L and
2*L, where L is the length was optimal. In contrast in PconsFold2 we let the contact map dictate
this number, and use a score threshold instead. When the alignments are poor, the scores will be
generally lower, and thus fewer contacts will be reliable. On the other hand, when the contact
maps are accurate, we can expect high scores, hence more contacts can be included in our
folding. In CASP12 we used all contacts with a PconsC3 score of 0.4 or higher.

143
Figure 2: PconsFold performance for the 21 targets Figure 3: Overlay of native (gray) and first ranked
as measured by TMscore (black line). Also plotted is PconsC3 prediction (orange) of T0872. TM-score is
the number of efficient sequences (Meff), length of 0.73
the target (L), PPV and average score of PconsC3
predictions.

Results
At the time of writing, PDB contained the structures of 21 CASP12 targets. For three targets
PconsFold2 makes good models (TM-score >0.6), Figure 2. In addition three targets had TM-
scores between 0.4 and 0.5. All three successful targets had good contact maps (PPV>0.6). In the
case of T0891 this was obtained from an alignment of less than 300 effective sequences. The best
prediction made by PconsFold2 was T0872, obtaining a TM score of 0.73 and a RMSD of 1.52
Å, Figure 3. This was better than any other server prediction. T0872 is also one of the shortest
target. But as can be seen for T0861 (L>200) PconsFold2 can also predict the structure
accurately for long targets.

For three targets (T0879, T0864, T0932) good contact (PPV>0.6) are obtained but the models are
bad (TM-score<0.3). All these targets are about 200 residues long. Further investigation is
needed to understand why PconsFold2 did not perform better for these targets. It is possible that
it is mainly a computational limitation, as all predictions were run on a single computer.

Availability
The code will be available on our Github page: https://github.com/ElofssonLab/

1. Skwark,M.J., Michel,M., Menéndez Hurtado,D., Ekeberg,M., Elofsson,A. (2016) Accurate


contact predictions for thousands of protein families using PconsC3. Submitted
2. Leaver-Fay, A., et al. (2011). Rosetta3: An object oriented software suite for the simulation
and design of macromolecules. Methods in Enzymology, 487 (19), 545–574.
3. Mirco Michel, Sikander Hayat, Marcin J. Skwark, Chris Sander, Debora S. Marks, and Arne
Elofsson (2014). PconsFold: improved contact predictions improve protein models.
Bioinformatics 30 (17), i482-i488.

144
PhyreTopoAlpha

Protein fold recognition using contact threading in the program PhyrePower

I. Filippis, H. Mao, M.J.E. Sternberg and L.A. Kelley


Structural Bioinformatics Group, Imperial College London
i.filippis@imperial.ac.uk

The most difficult modeling tasks involve target proteins with undetectable sequence similarity
to known template structures, despite there often being appropriate templates available. These
cases correspond to the CASP TBM-Hard and FM categories. Without sufficient sequence signal,
other sources of information must be used to model such targets.
Recent advances in protein contact prediction constitute a novel, yet error-prone source of
information. In this work we apply the paradigm of fold recognition/template-based modeling in
the space of contacts as opposed to sequence. Noise in predicted contacts is ameliorated through
the use of eigendecomposition.

Methods
CASP targets are first processed by MetaPSICOV1 to predict contacting residues, ranked by
predicted reliability. The top L1.19 predicted contacts are taken to constitute the contact map of the
target, where L is the target length. This contact map is eigendecomposed and the 7 eigenvectors
corresponding to the top 7 eigenvalues retained. Each residue of the target is thus represented by
7 numbers corresponding to the most informative aspects of the predicted contact map for that
position.
A database of known structures taken from SCOP95 and the PDB are converted in a
similar manner, this time using known contact maps. The target is aligned with dynamic
programming to this fold database, where residue-residue similarity is determined by the dot
product of their 7 dimensional representations. Alignments are ranked by the contact map
overlap (CMO) between the target and template contact maps. This measures the number of
shared contacts between target and template. The top ranked alignment is input to Modeller2 to
generate our Model 1 entry for CASP. Alignments for Models 2-4 are selected by the ProQ23
quality estimation package and similarly modeled by Modeller.
In addition, the predicted contacts are input directly to the Tinker molecular mechanics
package, generating a pool of candidate models. These are ranked using ProQ2, and the lowest
pseudo-energy model selected as our Model 5 submission.

Results
In our own benchmarks of target-template pairs undetectable by HHsearch4, PhyrePower is able
to correctly recognize folds in over 40% of cases and create models with TMscore>=0.5 in over
20% of cases. In addition it demonstrates significant orthogonality to direct modeling approaches
such as Tinker, with less than 50% overlap in modeled targets. This is primarily due to the noise-
tolerance of the new method to errors in contact prediction – normally a serious problem for
molecular mechanics approaches.

145
Availability
The PhyrePower method will be made available shortly as a Docker image.

1. Jones, D. T., Singh, T., Kosciolek, T. & Tetchner, S. MetaPSICOV: combining coevolution
methods for accurate prediction of contacts and long range hydrogen bonding in proteins.
Bioinformatics 31, 999-1006
2. Marti-Renom, M. A. et al. Comparative protein structure modeling of genes and genomes.
Annu Rev Biophys Biomol Struct 29, 291-325, doi:10.1146/annurev.biophys.29.1.291 (2000).
3. Uziela, K. & Wallner, B. ProQ2: Estimation of Model Accuracy Implemented in Rosetta.
Bioinformatics, doi:10.1093/bioinformatics/btv767 (2016).
4. Soding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951-
960, doi:10.1093/bioinformatics/bti125 (2005).

146
ProQ2

Model Quality Assessment and Selection using ProQ2

B. Wallner
Linköping University, S-581 83 Linköping, Sweden
bjornw@ifm.liu.se

ProQ21 is a single-model quality assessment program that uses support vector machines to
predict local as well as global quality of protein models that was recently implemented in
Rosetta2. The prediction is based on calculation of scalar features from each protein model based
on properties that can be derived from its sequence (e.g. conservation, predicted secondary
structure, and exposure) or 3D coordinates (e.g. atom-atom contacts, residue–residue contacts,
and secondary structure). To achieve a localized prediction, the environment around each residue
is described by calculating the features for a sliding window centered on the residue of interest.
For features involving spatial contacts, residues or atoms outside the window that are in spatial
proximity of those in the window are included as well. After the local prediction is performed,
global prediction is achieved by summing the local predictions and normalize by the target
sequence length to enable comparisons between proteins. Thus, the global score is a number in
the range [0,1]. The local prediction for residue, i, is the local S-score3 with distance threshold
d0=3Å, Si=1/(1+(di/d0)2). If we are interested in the per residue distance deviation di, we could
simply solve the equation for di, di=3(1/ Si-1)1/2, S=(0,1]. Since large distance deviations are
equally meaningless we put all di>15Å to 15Å.

Methods
ProQ2 participated in the TS and QA categories in CASP12, the version is identical to the one
that participated in CASP11, and thus could be used as baseline to compare performance
between CASPs. For the TS category ProQ2 was used to assess the quality of all submitted
server models and the top five scoring models, excluding models that were 1Å RMSD similar,
were submitted, after the local predicted deviation had been added to the B-factor column. For
QA ProQ2 was used on the set of models send by the CASP organizers.

Availability
The current version of ProQ2 is currently available as a scoring function in the Rosetta modeling
suite from http://www.rosettacommons.org. Additional scripts and programs to generate all
necessary input files can be found here: http://github.com/bjornwallner/ProQ_scripts/.

1. Ray, A., Lindahl, E. & Wallner, B. Improved model quality assessment using ProQ2. (2012)
BMC Bioinformatics 13, 224 (2012).
2 Uziela K, Wallner B. ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics 32(9):1411-3 (2016).
3. Wallner, B. & Elofsson, A. Identification of correct regions in protein models using
structural, alignment, and consensus information. Protein Sci. 15, 900–913 (2006).

147
ProQ3, ProQ3_1, ProQ3_1diso, ProQ3_2diso, RSA_SS_CONS, Pcons,
Pcons-net

Improved model quality assessments using Rosetta energy terms and deep learning
1 1 1 2 1
Karolis Uziela , Nanjiang Shu , David Menéndez Hurtado , Björn Wallner and Arne Elofsson
1 - Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Stockholm
10691, Sweden, 2 - Department of Physics, Chemistry and Biology, Linköping University, 581 83 Linköping,
Sweden.
arne@bioinfo.se

Recently, we have developed a new model quality assessment method—ProQ3. It uses the same
machine learning method as ProQ2 (linear SVM), but has some new training features calculated
from Rosetta energy terms. Here, we describe ProQ3 and its deep learning-based variations,
ProQ3_1, ProQ3_1_diso, ProQ3_2_diso, as well as our classical consensus method Pcons and a
reference method RSA_SS_CONS. Finally, we present Pcons-net QA method that is a
combination of ProQ3 and Pcons.

Methods
ProQ31 is a combination of ProQ22 and two novel Rosetta-based predictors ProQRosFA and
ProQRosCen. ProQRosFA uses Rosetta full-atom energy functions, while ProQRosCen uses
Rosetta centroid energy functions. All input features from the three predictors are combined
together to train a linear SVM. The training data set is a subset of CASP9 with 30 models per
target.
ProQ3_1 uses the same input features as ProQ3, but it is trained with deep learning
approach. We used Python/Theano framework. We used two layers of neurons: 600 in the first
layer and 200 in the second. We used L1/L2 regularizer with 10-5 penalty parameter. The network
was trained for 500 epochs. The training data set consists of all models from CASP9, CASP10
and CASP11.
ProQ3_1_diso is the same as ProQ3_1, but residues that are predicted to be disordered
(by Disopred3) are filtered out before calculating the global score of the model. Local scores are
not affected and are the same as in ProQ3_1. We noticed that the disordered residues affect
CASP QA evaluation quite a lot. The disordered residues are often present in the target sequence
but their coordinates are missing in the native structure. Hence, a lot of structure prediction
groups have these disordered residues in their models, but the prediction accuracy of these
residues cannot be evaluated. Usually, in ProQ2/ProQ3 methods we calculate the global model
quality score by summing all local quality scores and dividing by target length. Here, we
excluded all the residues that are predicted to be disordered from the sum when calculating the
global score. We have benchmarked such approach on CASP11 data set and the results for global
quality assessment were slightly better than regular, so we expect it be useful in CASP12, too.
The difference should be only visible on targets that contain disordered residues.
ProQ3_2_diso is identical to ProQ3_1_diso, but an additional method for RSA/SS
(Relative Surface Area accessibility, Secondary Structure) predictions is used. Here, we used
RSA/SS predictions from Sspro/ACCpro 54 to calculate the agreement between the RSA/SS of
the model and the predicted values. The agreement scores were averaged over windows of 1, 11,

148
21 and full-sequence lengths. In addition, to regular 3 state SS predictions, we made predictions
and calculated agreement scores for 8 SS states. Also, in addition to regular 2 state RSA
predictions (buried/exposed) we made predictions for 20 state RSA that denote the level of
surface exposure and calculated the agreement with RSA values of the model. The original
RSA/SS agreement features derived from ACCpro 45 and Psipred6 that were used in
ProQ2/ProQ3 are also included.
RSA_SS_CONS is a predictor that is based on only three types of features: RSA, SS and
Conservation. It uses a linear SVM. The features are derived the same way as in ProQ3_2_diso.
Conservation (sequence entropy) features are the same as in ProQ2. We included this predictor as
a baseline to check what MQA accuracy can be achieved by only using RSA, SS and
Conservation features.
Pcons7 is our classical consensus-based method. We have not modified it in any way.
Pcons-net is a combination of ProQ3 and Pcons. We calculate its score using formula 0.2
* ProQ3 + 0.8 * Pcons. This is a similar approach to Pcomb8 where we used a different
weighting formula 0.2 * ProQ2 + 0.8 * Pcons.

Name (alternative) Training Training Feature Disopred Comment


method
ProQ3 SVM – linear ProQ2+Rosetta No Single-model MQA
ProQ3_1 9 ProQ2+Rosetta No Single-model MQA
Theano
ProQ3_1_diso Theano ProQ2+Rosetta Yes Single-model MQA
ProQ3_2_diso Theano ProQ2+Rosetta+New Yes Single-model MQA
SS/RSA
RSA_SS_CONS SVM Agreement with RSA + SS No Single-model MQA
Pcons - - No Consensus MQA
Pcons-net (Pcomb) - - No 0.2* ProQ3 + 0.8 * Pcons

Short summary of described methods is presented in Table 1.

Availability
ProQ3 is available at http://proq3.bioinfo.se/.

1. Uziela K, Wallner B, Elofsson A (2016). ProQ3: Improved model quality assessments using
Rosetta energy terms. Nature Scientific Reports. Accepted manuscript.
2. Ray A, Lindahl E, Wallner B (2012) Improved model quality assessment using ProQ2. BMC
Bioinformatics 13, 224
3. Jones DT, Ward JJ (2003) Prediction of disordered regions in proteins from position specific
score matrices. Proteins 53 Suppl 6, 573-578
4. Magnan CN, Baldi P (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary
structure and relative solvent accessibility using profiles, machine learning and structural
similarity. Bioinformatics 30(18), 2592-2597
5. Cheng J, Randall AZ, Sweredoski MJ, Baldi P (2005) SCRATCH: a protein structure and
structural feature prediction server. Nucleic Acids Res 33(Web Server issue), 72-76
6. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring
matrices. J Mol Biol 292(2), 195-202
7. Wallner B, Elofsson A (2006) Identification of correct regions in protein models using
structural, alignment, and consensus information. Protein Sci 15(4), 900-913

149
8. Larsson P, Skwark MJ, Wallner B, Elofsson A (2009) Assessment of global and local model
quality in CASP8 using Pcons and ProQ. Proteins 77 Suppl 9, 167-172
9. Al-Rfou et al. (2016) Theano: A Python framework for fast computation of mathematical
expressions. arXiv:1605.02688v1

150
ProTSAV-Plus

ProTSAV+: A metaserver for protein tertiary structure quality assessment

Ankita Singh1, Rahul kaushik1,2, B.Jayaram1,2,3


1- Supercomputing Facility for Bioinformatics and Computational Biology, IIT Delhi, 2- Kusuma School of
Biological Sciences, IIT Delhi, 3- Department of Chemistry, IIT Delhi, Hauz Khas, New Delhi (India) – 110016
ankita@scfbio-iitd.res.in, bjayaram@chemistry.iitd.ac.in

Improvements in protein structure prediction methodologies over the years have created urgency
for highly accurate quality assessment methods for predicted protein structures. Better quality
structures help in function assignment and in structure based drug discovery. Here, we describe
ProTSAV+ metaserver which integrates 11 individual tools/approaches of quality assessment and
predicts an overall score reflective of the quality of the protein structure. The metaserver
provides the user with a single quality score in case of individual model structure and ranking in
case of multiple decoy structures. The server overcomes the limitations of any single
server/method and is seen to be robust in helping in improved quality assessment.

Methods
ProSTAV+ in an advanced version of ProTSAV1 which does weightage based combination of
some of the renowned and thoroughly validated freely/on request available tools. These tools are
PROCHECK2, PROSA3,ERRAT4,VERIFY-3D5,DFIRE6, NACCESS7, MOLPROBITY8, D2N9,
PRO-Q10, PSN-QA11 and MM-GBSA12.These tools mainly embed various structural and
energetic features individually or in combination like non-covalent interactions, residues based
contact potentials, burial preferences of residues, accessible surface area, residue packing
preferences, globularity, secondary structural information, backbone dihedral distribution
preferences, side chain packing, energy based scoring etc..
ProTSAV+ can perform both, the quality assessment of individual structure and relative
ranking for multiple structures of the same protein. For a given input structure(s), it calculates
overall global score (P-score) in the range of 0 to 1 where scores closer to 1 reflect better quality
and scores closer to 0 reflect bad structures.

Results
ProTSAV+ metaserver was fielded in the recently concluded CASP12 experiment for quality
assessment and ranking of the model structures submitted by participating structure prediction
servers. The method has performed reasonably well so far, on the targets whose native
information is released in PDB. For instance, ProtSAV+ succeeded in evaluating and ranking the
model structures for 20 targets (whose native information is released) with a correlation
coefficient of 0.81 with calculated GDT scores and -0.76 with calculated rmsds.

Availability
ProTSAV+ metaserver can be freely accessed from http://scfbio-iitd.res.in/ProTSAV+

1. Singh,A., Kaushik,R., Mishra,A., Shanker,A. & Jayaram,B. (2016). ProTSAV: A Protein


Tertiary Structure Analysis and Validation Server. BBA-Proteins and Proteomics 1864, 11-19.

151
2. Laskowski,R.A. (1993). PROCHECK: a program to check the stereo chemical quality of
protein structures. J. Appl. Crystallogr. 26, 283-291.
3. Wiederstein,M., Sippl.M.J. (2007). ProSA-web: interactive web service for the recognition of
errors in three-dimensional structures of proteins. Nucleic Acids Res. 35, W407-W410.
4. Colovos,C., Yeates,T.O. (1993). Verification of protein structures, patterns of non-bonded
atomic interactions. Protein Sci. 2, 1511-1519.
5. Luthy,R., Bowie,J.U., Eisenberg,D. (1992). Assessment of protein models with three
dimensional profiles. Nature 356, 83–85.
6. Y,Yang., Zhou,Y. (2008). Specific interactions for ab initio folding of protein terminal
regions with secondary structures. Proteins 72,793-803.
7. Lee,B., Richards,F.M. (1971). The interpretation of protein structures: estimation of static
accessibility. J. Mol. Biol. 55, 379-400.
8. Davis,I.W., Fay,L.A., Chen,B.V., Chen,B., Block,N.J., et al. (2007). MolProbity: all-atom
contacts and structure validation for proteins and nucleic acids. Nucleic Acids Res. 35, W375-
W383.
9. Mishra, A., Rana,P.S., Mittal, A. & Jayaram,B. (2014). D2N: distance to the native, BBA-
Proteins and Proteomics 10, 1798-1807.
10. B,Wallner., Elofsson,A. (2003). Can correct protein models be identified? Protein Sci.
12, 1073-1086.
11. Ghosh,S., Vishveshwaraa,S. (2014). Ranking the quality of protein structure models
using side chain based network properties. F1000 Res. 3, 17.
12. Jayaram,B., Sprous,D., Beveridge,D,L. (1998). Solvation Free Energy of
Biomacromolecules: Parameters for a Modified Generalized Born Model Consistent
with the AMBER Force Field. J. Phys. Chem. B. 102, 9571-9576.

152
QASproCL

Quality ASsessment of protein structure based on CLustering

Sandeep Singh, Gandharva Nagpal, Deepti Sethi, Chinmayee Choudhury and Piyush Agarwal
and Gajendra P.S. Raghava*
Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh (160036), India
*raghava@imtech.res.in

QASproCL method is a hybrid approach combining QASproGP (a single-model method) and


clustering-based approach.

Methods
The QASproCL method is trained on a large dataset of structural models obtained from CASP8,
CASP9 and CASP10 experiments. A cutoff of ≥ 0.3 GDT_TS score was applied to select the
structural models used in the training set. The test set comprised of 79 targets obtained from
recent CASP11 experiments (stage1 tarballs) without applying any cutoff criteria. We also
checked the performance of our method on the recent released PDB structures of 21 targets of
the present CASP12 experiment until 14th September 2016. We calculated 3 different types of
scores for each structural model of a given target: (a) average GDT_TS score of the structural
model with all the other structural models of a given target; (b) average Pearson Correlation
Coefficient (PCC) between the computed accessible surface area (using DSSP software 1) of the
structural model with all the other structural models; and (c) the prediction score as given by
QASproGP method. These 3 scores were then used to fit the linear regression model to finally
predict the GDT_TS score of the structural model. The equation of the linear regression model
was as follows: (1.1167*a) + (0.0407*b) + (0.1891*c), where a, b and c represents average
GDT_TS score, average PCC and QASproGP score respectively.

Results
To evaluate the performance of QASproCL method, we calculated the PCC between the actual
and the predicted GDT_TS score of the structural models of the test set. We obtained PCC of
~0.954 on the CASP11 test set. On 21 targets of the CASP12 experiments released as stage1 and
stage2 tarballs, we achieved PCC of 0.562 and 0.687 respectively.

Availability
QASproCL method is implemented in the form of a web-based service which can be accessed at
http://crdd.osdd.net/raghava/qaspro/.

1 Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of


hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637,
doi:10.1002/bip.360221211 [doi] (1983).

153
QASproGP

Quality ASsessment of protein structure based on Geometrical Parameters

Gandharva Nagpal, Sandeep Singh, Deepti Sethi, Chinmayee Choudhury and Piyush Agarwal
and Gajendra P.S. Raghava*
Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh (160036), India
*raghava@imtech.res.in

A number of methods are available to predict the tertiary structure of proteins. These methods
generate a selected number of tertiary structure models among which lie the near-native
structures. The current challenge faced by these prediction methods is to select the best model
from a number of structural models. In the past, methods have been developed to rank the
predicted structural models based on the quality of the models 1. In this study we have developed
a method QASproGP to select and rank the predicted structural models based on different
geometrical parameters of the models. QASproGP is a single-model method and is trained on a
large dataset of structural models (~16000 models) obtained from CASP8, CASP9 and CASP10
experiments.

Methods
We obtained the predicted structural models (designated as Model 1) from CASP8, CASP9 and
CASP10 experiments with GDT_TS score ≥ 0.3 and used this as our training dataset. For test set,
we used the predicted models of 79 targets extracted from CASP11 experiment. We also tested
the performance of our method on 21 targets of CASP12 whose PDB structures have been
released until 14th September 2016. We generated different geometrical parameters from the
predicted structural models which are as follows: (a) amino acid-wise accuracy between the
computed and predicted secondary structure states (from SCRATCH software 2) of the model
(vector size 20); (b) amino acid-wise accuracy between the computed and predicted accessible
surface area (ASA) of the model (vector size 20); (c) evolutionary information with respect to
the buried, exposed and intermediate residues of the model (vector size 60); (d) amino acid-wise
ASA of the model (vector size 60); (e) different geometrical parameters as calculated by VADAR
software 3 (vector size 60); (f) hydrogen bond types and their percent in the model as calculated
by DSSP software 4 (vector size 12) and (g) atom-atom contacts between 13 atom types as
defined in the previous study 5 (vector size 91). This way, a total of 323 features are used in
QASproGP to predict the quality score of the structural model using support vector machines.

Results
We used Pearson Correlation Coefficient (PCC) between actual and predicted GDT_TS score of
the models as a measure to evaluate the performance of our method. PCC was calculated when
all the models were pooled together. We obtained PCC of ~0.82 on 79 targets of CASP11
released as stage1 tarballs. On 21 targets of the present CASP12 experiment, our method
achieved PCC of 0.565 and 0.633 on stage1 and stage2 released tarballs respectively.

Availability
QASproGP method is implemented in the form of a web-based service which can be accessed at

154
http://crdd.osdd.net/raghava/qaspro/.

1. Kryshtafovych, A. et al. Methods of model accuracy estimation can help selecting the best
models from decoy sets: Assessment of model accuracy estimations in CASP11. Proteins,
doi:10.1002/prot.24919 (2015).
2. Cheng, J., Randall, A. Z., Sweredoski, M. J. & Baldi, P. SCRATCH: a protein structure and
structural feature prediction server. Nucleic Acids Res 33, W72-76, doi:33/suppl_2/W72 [pii]
10.1093/nar/gki396 (2005).
3. Willard, L. et al. VADAR: a web server for quantitative evaluation of protein structure
quality. Nucleic Acids Res 31, 3316-3319 (2003).
4. Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of
hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637,
doi:10.1002/bip.360221211 [doi] (1983).
5. Ray, A., Lindahl, E. & Wallner, B. Improved model quality assessment using ProQ2. BMC
Bioinformatics 13, 224, doi:1471-2105-13-224 [pii] 10.1186/1471-2105-13-224 (2012).

155
QMEANDisCo

QMEANDisCo – Combining Statistical Potentials with Consensus-based Prediction of


Local Model Quality

C. Rempfer1,2, G. Studer1,2, J. Haas1,2 and T. Schwede1,2


1 - Biozentrum – University of Basel, 2 - SIB – Swiss Institute of Bioinformatics
gabriel.studer@unibas.ch

Computational protein structure modelling methods, and in particular comparative modelling,


have established themselves as valuable complement for structural analysis when experimental
data is missing. While such methods have matured into stable and robust pipelines that can
generate models for almost any protein automatically, the quality of the generated models can be
highly variable and hard to predict in the absence of experimental observables. This is a major
concern from an application perspective as the suitability of a model for a specific application
directly depends on its quality, hence the importance of quality estimation methods. Currently,
the most accurate QE methods rely on consensus information by assessing the variability in an
ensemble of models, assuming that correct structural features will tend to be more conserved.
However, this approach is only applicable if several alternative models are available. In contrast,
knowledge based statistical methods can be applied to single models by comparing structural
features in the model with those obtained from statistical distributions derived from high quality
experimental structures. An example of the latter approach is QMEAN1 developed in our group.

Methods

QMEAN uses statistical potentials of mean force and the consistency of a model with structural
features predicted from sequence to generate quality estimates on a global and local scale. A
specialized version called QMEANBrane2 employs statistical potentials specifically trained to
assess the local quality of membrane protein models. Here we extend QMEAN by a new term
called DisCo.
DisCo evaluates the consistency of the model with distance-based constraints extracted from
experimentally determined structures of homologous proteins. In case many close homologues
structures exist, DisCo can expected to be very accurate. However, if few or no close
homologues templates can be identified, DisCo does not contain sufficient information for
scoring models. In order to combine the ability of statistical potentials to score individual models
with the power of DisCo in cases with sufficient template information, we use a machine
learning approach to optimally weigh the two components and derive a combined score for
accurate local quality estimates: QMEANDisCo. The score was trained using global or local
lDDT3 as a target value for optimization. We avoid using global or local target value based on
reduced structural representations as these measures do not reflect the variety of local atomic
interactions in sufficient detail.

156
Results

QMEANDisCo combines the convenience of taking a single model as input with the predictive
power of consensus approaches over a wide range of model accuracy by combining statistical
potentials with distance constraints from homologous templates. We show that QMEANDisCo
significantly increases the reliability in estimating local model qualities compared to QMEAN on
a wide variety of test sets. QMEAN and QMEANDisCo have been developed for detecting local
errors in models with correct overall fold. An extensive analysis including comparison to other
state of the art methods shows that QMEAN and QMEANDisCo perform well for this
application and outperform other state of the art quality estimation methods on this aspect.

1. Benkert P., Biasini M, Schwede T. (2011). Toward the estimation of the absolute quality of
individual protein structure models. Bioinformatics. 27(3), 343-50.
2. Studer G., Biasini M., Schwede T. (2014). Assessing the local structural quality of
transmembrane protein models using statistical potentials (QMEANBrane). Bioinformatics.
30(17), i505-11
3. Mariani,V., Biasini, M., Barbato A., Schwede T. (2013) lDDT: a local superposition-free
score for comparing protein structures and models using distance difference tests.
Bioinformatics. 29(1), 2722-8

157
QUARK

NNB-QUARK: Ab Initio Protein Structure Assembly Guided by Sequence-based Contact


Predictions

Baoji He1,2, Chengxin Zhang1, Yanting Wang2, Yang Zhang1


1-Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of
Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109, 2-Institute of Theoretical Physics, The Chinese Academy of
Sciences, Beijing 100080;
yangzhanglab@umich.edu

QUARK has been developed for ab initio protein structure prediction by the assembly of
continuously sized fragments.1, 2 Recently, we developed NNB-QUARK which uses the
sequence-based contact prediction from NN-BAYES3 to guide the structure assembly
simulations. NNB-QUARK was tested in CASP12 as the QUARK server.

Methods
Starting from the query sequence, a set of structural fragments with 1-20 residues is collected
from the structure of unrelated proteins in the PDB. Full-length structure models are then
assembled from the fragments by replica-exchanged Monte Carlo (REMC) simulations. The
original knowledge-based QUARK force field contains a variety of local structure features
derived from sequence (e.g. beta-turns, backbone torsion angles, solvent accessibility, and helix
and strand packing possibilities). In particular, a set of long-range residue-residue contacts
derived from the fragment-based distance profiles were used to assist the fragment assembly
simulation.2 The final models were selected based on the SPICKER clustering4 of the simulation
decoys, which are further refined by the ModRefiner5 and FG-MD6 programs.
The major difference between QUARK and NNB-QUARK is that the sequence-based
contact predictions generated by NN-BAYES has been incorporated in the NNB-QUARK force
filed to guide the simulations. The energy function of contact restraints in NNB-QUARK is:

−𝑈𝑖𝑗′ 8 < 𝑑𝑖𝑗


1 ′ 𝑑𝑖𝑗 − 9
− 𝑈𝑖𝑗 [1 − sin ( 𝜋)] 8 ≤ 𝑑𝑖𝑗 ≤ 10
2 2
𝐸𝑟𝑒𝑠𝑡 (𝑖, 𝑗) = (1)
1 ∗ 𝑑𝑖𝑗 − 45
𝑈 [1 + sin ( 𝜋)] 10 ≤ 𝑑𝑖𝑗 ≤ 80
2 𝑖𝑗 70

{𝑈𝑖𝑗 𝑑𝑖𝑗 > 80

where 𝑑𝑖𝑗 is the distance between ith and jth residues in the unit of Å. The height of the energy
barrier is decided by the predicted accuracy (𝑎𝑐𝑐𝑖𝑗 ) of the NN-BAYES contact prediction:

𝑈𝑖𝑗′ = 𝑘𝐵 𝑇𝑙𝑛(𝑎𝑐𝑐𝑖𝑗 /0.22)


{ (2)
𝑈𝑖𝑗∗ = 𝑘𝐵 𝑇𝑙𝑛(0.7/𝑎𝑐𝑐𝑖𝑗 )

158
where 𝑎𝑐𝑐𝑖𝑗 is associated with the confidence score of NN-BAYES that has been linearly
renormalized from [0.3, 1] to [0.22, 0.7]. Physically, Eq. (2) is equivalent to the assumption that
the confidence score (or the predicted accuracy) is inversely proportional to the probability of the
constrained residue pairs to escape the potential well. Different from many of square-like energy
wells in which rapid turns exist in the transition or boundary points, the potential in Eq. (1) has
been designed to have the gradient of the potential, 𝜕𝐸/𝜕𝑑, equal to 0 at each transition point at
8, 10 and 80 Å; this helps smoothen the movement of the simulations at different distance of
restraints.
Only the contacts with a confidence score above 𝑠𝑐𝑜𝑟𝑒𝑐𝑢𝑡 is considered, where 𝑠𝑐𝑜𝑟𝑒𝑐𝑢𝑡 =
0.5 for short-, 0.4 for medium-, and 0.3 for long-range contacts. Meanwhile, the total number of
contacts used should be below the predicted number of contacts, which is estimated by a separate
neural network predictor that was trained on the length and secondary structure composition of
the query sequence.

Availability
The on-line NNB-QUARK server to model proteins with length below 200 residues is available
at http://zhanglab.ccmb.med.umich.edu/NNB-QUARK.

1. Xu, D.; Zhang, Y., Ab initio protein structure assembly using continuous structure fragments
and optimized knowledge-based force field. Proteins 2012, 80, 1715-35.
2. Xu, D.; Zhang, Y., Toward optimal fragment generations for ab initio protein structure
assembly. Proteins 2013, 81, 229-39.
3. He, B.; Mortuza, G.; Wang, Y.; Y., Z., NN-BAYES: Predicting protein contact maps using
neural network training coupled with naive Bayes classifiers 2016, submitted.
4. Zhang, Y.; Skolnick, J., SPICKER: A clustering approach to identify near-native protein
folds. J Comput Chem 2004, 25, 865-71.
5. Xu, D.; Zhang, Y., Improving the physical realism and structural accuracy of protein models
by a two-step atomic-level energy minimization. Biophys J 2011, 101, 2525-34.
6. Zhang, J.; Liang, Y.; Zhang, Y., Atomic-level protein structure refinement using fragment-
guided molecular dynamics conformation sampling. Structure 2011, 19, 1784-95.

159
RaptorX-Contact

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning

Jinbo Xu, Sheng Wang, Siqi Sun, Zhen Li and Renyu Zhang
Toyota Technological Institute at Chicago
jinboxu@gmail.com
Protein contact prediction has made good progress due to the development of direct evolutionary
coupling (EC). However, direct EC analysis is effective on only some proteins with a very large
number of sequence homologs. To deal with proteins with fewer sequence homologs, we borrow
ideas from deep learning, a powerful machine learning technique that has recently revolutionized
object and speech recognition and the GO game. Our deep learning method integrates residue co-
variation and conservation information through two deep residual neural networks and learns
complex sequence-contact relationship and long-range contact correlation from the protein
sequence and structure databases. Our experimental results suggest that deep learning can
significantly improve contact prediction and contact-based folding. Details are available at
http://arxiv.org/abs/1609.00680 or http://biorxiv.org/content/early/2016/09/06/073239.
Methods
Fig. 1 illustrates our deep neural network for contact prediction. Different from previous
supervised learning approaches that
employ only a small number of hidden layers
(i.e., a shallow architecture), our deep
network employs dozens of hidden layers. By
using a very deep architecture, our model can
automatically learn complex sequence-
contact relationship and inter-contact
correlation and thus, improve contact
prediction. Our model consists of two major
modules, each being a residual neural
network. The first module conducts a series of 1-
dimensional (1D) convolutional
transformations of sequential features
(sequence profile, predicted secondary structure and solvent accessibility). The output of the 1 st
module is converted to a 2-dimensional (2D) matrix by an operation similar to outer product and
fed into the 2nd module together with pairwise features (i.e., MI, direct co-evolution and pairwise
contact potential). The 2nd module is a 2D residual network that conducts a series of 2D
convolutional transformations of its input. Finally, the 2nd module feeds its output to a logistic
regression, which predicts the probability of any two residues form a contact. In addition, each
convolutional layer is also preceded by a simple nonlinear transformation called rectified linear
unit. The output of each 1D convolutional layer has dimension L×m where L is protein sequence
length and m the number of hidden neurons at one residue. The output of a 2D convolutional
layer has dimension L×L×n where n is the number of hidden neurons for one residue pair. The
number of hidden neurons may vary at each layer.

160
Results
1. Tested on 579 proteins, our method greatly outperforms the others (Tables 1-3).
2. Using our predicted contacts as restraints of ab initio folding, we can build 3D models much
better than homology models (especially for membrane proteins).
3. The deep learning models not trained by any membrane proteins perform almost equally well
on membrane proteins as those trained with both membrane and non-membrane proteins, much
better than those trained by only membrane proteins.
4. We have been developing this method during CASP12 and not finished it until the end of
CASP12 and thus, only applied it to the last several targets. In the past 3 weeks we blindly tested
our fully-automated contact prediction server (server ID: server60) through CAMEO. Our web
server successfully folded three hard CAMEO targets with a new fold: one beta protein of 182
residues (CAMEO ID: 2016-09-10_00000002_1, PDB ID:2nc8A), one alpha protein of 140 residues
(CAMEO ID: 2016-09-24_00000052_1, PDB ID:5djeB) and one alpha+beta protein of 125 residues
(CAMEO ID: 2016-09-17_00000018_1, PDB ID:5dcjA). They have 180--330 effective sequence
homologs. Our contact-based models for 5djeB and 2nc8A have TMscore 0.65 and 0.61,
respectively, while the best models submitted by the other CAMEO servers have TMscore 0.35
and 0.47, respectively.
Table 1. Contact prediction accuracy on 105 CASP11 test proteins.
Method Short Medium Long
L/10 L/5 L/2 L L/10 L/5 L/2 L L/10 L/5 L/2 L
EVfold 0.25 0.21 0.15 0.12 0.33 0.27 0.19 0.13 0.37 0.33 0.25 0.19
PSICOV 0.29 0.23 0.15 0.12 0.34 0.27 0.18 0.13 0.38 0.33 0.25 0.19
CCMpred 0.35 0.28 0.17 0.12 0.40 0.32 0.21 0.14 0.43 0.39 0.31 0.23
MetaPSICOV 0.69 0.58 0.39 0.25 0.69 0.59 0.42 0.28 0.60 0.54 0.45 0.35
Our method 0.82 0.70 0.46 0.28 0.85 0.76 0.55 0.35 0.81 0.77 0.68 0.55
Table 2. Contact prediction accuracy on 76 CAMEO test proteins.
Method Short Medium Long
L/10 L/5 L/2 L L/10 L/5 L/2 L L/10 L/5 L/2 L
EVfold 0.17 0.13 0.11 0.09 0.23 0.19 0.13 0.10 0.25 0.22 0.17 0.13
PSICOV 0.20 0.15 0.11 0.08 0.24 0.19 0.13 0.09 0.25 0.23 0.18 0.13
CCMpred 0.22 0.16 0.11 0.09 0.27 0.22 0.14 0.10 0.30 0.26 0.20 0.15
MetaPSICOV 0.56 0.47 0.31 0.20 0.53 0.45 0.32 0.22 0.47 0.42 0.33 0.25
Our method 0.67 0.57 0.37 0.23 0.69 0.61 0.42 0.28 0.69 0.65 0.55 0.42
Table 3. Contact prediction accuracy on 398 membrane proteins.
Method Short Medium Long
L/10 L/5 L/2 L L/10 L/5 L/2 L L/10 L/5 L/2 L
EVfold 0.16 0.13 0.09 0.07 0.28 0.22 0.13 0.09 0.44 0.37 0.26 0.18
PSICOV 0.22 0.16 0.10 0.07 0.29 0.21 0.13 0.09 0.42 0.34 0.23 0.16
CCMpred 0.27 0.19 0.11 0.08 0.36 0.26 0.15 0.10 0.52 0.45 0.31 0.21
MetaPSICOV 0.45 0.35 0.22 0.14 0.49 0.40 0.27 0.18 0.61 0.55 0.42 0.30
Our method 0.60 0.46 0.27 0.16 0.66 0.53 0.33 0.22 0.78 0.73 0.62 0.47

Availability: See our web server at http://raptorx.uchicago.edu/ContactMap/.

161
RBO_Aleph

Protein Structure Prediction by RBO Aleph in CASP12

Mahmoud Mabrouk1, Kolja Stahl1, Michael Schneider2 and Oliver Brock1


1 – Robotics and Biology Laboratory, Technische Universität Berlin, 2 - Chair of Bioanalytics, Institute of
Biotechnology, Technische Universität Berlin
oliver.brock@tu-berlin.de

RBO Aleph is a protein structure prediction server with a focus on free modeling targets. Our
approach is based on two key ideas: 1) leveraging diverse information sources to gain
knowledge, in the form of contacts, about the native conformation 2) incorporating this
knowledge into the energy landscape and using Model-based Search (MBS) to steer search
towards low-energy regions. MBS builds a model of the energy landscape to focus sampling into
low-energy regions. This approach enables us to efficiently exploit predicted residue-residue
contacts in search.
This server is an update of a previous version that participated in CASP11. The new version
includes a novel contact prediction method (server RBO-Epsilon). RBO-Epsilon leverages
diverse and complementary information sources (evolutionary, sequence-based and
physicochemical) to predict contacts using a deep learning approach. The update also addresses a
major shortcoming of our pipeline; the lack of diversity in our decoy sets. We tackle this issue by
using an explorative algorithm, Rosetta, in addition to MBS for conformational search.

Methods
Pipeline overview. RBO Aleph1 retrieves templates using a combination of scores from
HHsearch2, LOMETS3, SparksX4 and RaptorX5. In case templates are found, they are used to
split the protein into domains, if not, the domain boundaries are predicted using two sequence-
based domain prediction algorithms, PPRODO6 and DomPro7. Domains with available templates
are modeled using Modeller and the top models are selected using Prosa8, a knowledge-based
energy function. Free modeling domains are modeled using the method described below.
Contact prediction. We use a new contact prediction method developed for CASP12 (server
RBO-Epsilon). Our method extends over current approaches by combining evolutionary
(GaussDCA9, CCMpred10, EVfold11, GREMLIN12, PSICOV13), sequence-based and
physicochemical information (EPC-map14). Deep learning is used to effectively exploit the
different profiles of the information sources. A more detailed description of the methods can be
found in the abstract of RBO-Epsilon in this issue.
Ab initio prediction. Similarly to CASP11, we use Model-based Search (MBS15) to leverage the
predicted contacts in conformational search. MBS identifies funnels in the energy landscape and
incrementally increases the sampling in the regions containing low-energy conformation. The
predicted contacts are incorporated as distance constraints and added to the energy function to
bias the search. Previous analysis16 showed that MBS performs poorly compared to other
methods when fed wrong contacts. This lies in its tendency to over-exploit the input information
and thereby focus sampling on non-native space regions. In CASP12, we attempt to overcome
this problem by using Rosetta17 Monte-Carlo-based search, in addition to MBS, to sample the
conformational space. Compared to MBS, Rosetta produces a more diverse decoy sets.

162
Combining both methods allows us to produce a decoy set that is diverse yet contains decoys
which satisfy the majority of contacts. We finally use the Prosa scoring function to select the top
models from the decoys generated from MBS and Rosetta.

Availability
We offer access to our protein structure and contact prediction methods through a webserver
under http://compbio.robotics.tu-berlin.de/rbo_aleph/

1. Mabrouk, M., Putz, I., Werner, T., Schneider, M., Neeb, M., Bartels, P., & Brock, O. (2015).
RBO Aleph: leveraging novel information sources for protein structure prediction. Nucleic
acids research, gkv357.
2. Söding, J., Biegert, A., & Lupas, A. N. (2005). The HHpred interactive server for protein
homology detection and structure prediction. Nucleic acids research, 33(suppl 2), W244-
W248.
3. Wu, S., & Zhang, Y. (2007). LOMETS: a local meta-threading-server for protein structure
prediction. Nucleic acids research, 35(10), 3375-3382.
4. Yang, Y., Faraggi, E., Zhao, H., & Zhou, Y. (2011). Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of query and corresponding native properties of templates.
Bioinformatics, 27(15), 2076-2082.
5. Peng, J., & Xu, J. (2011). RaptorX: exploiting structure information for protein alignment by
statistical inference. Proteins: Structure, Function, and Bioinformatics, 79(S10), 161-171.
6. Sim, J., Kim, S. Y., & Lee, J. (2005). PPRODO: prediction of protein domain boundaries
using neural networks. Proteins: Structure, Function, and Bioinformatics, 59(3), 627-632.
7. Cheng, J., Sweredoski, M. J., & Baldi, P. (2006). DOMpro: protein domain prediction using
profiles, secondary structure, relative solvent accessibility, and recursive neural networks.
Data Mining and Knowledge Discovery, 13(1), 1-10.
8. Wiederstein, M., & Sippl, M. J. (2007). ProSA-web: interactive web service for the
recognition of errors in three-dimensional structures of proteins. Nucleic acids research,
35(suppl 2), W407-W410.
9. Tetchner, S., Kosciolek, T., & Jones, D. T. (2014). Opportunities and limitations in applying
coevolution-derived contacts to protein structure prediction. Bio-Algorithms and Med-
Systems, 10(4), 243-254.
10. Seemayer, S., Gruber, M., & Söding, J. (2014). CCMpred—fast and precise prediction of
protein residue–residue contacts from correlated mutations. Bioinformatics, 30(21), 3128-
3130.
11. Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., & Sander,
C. (2011). Protein 3D structure computed from evolutionary sequence variation. PloS one,
6(12), e28766.
12. Kamisetty, H., Ovchinnikov, S., & Baker, D. (2013). Assessing the utility of coevolution-
based residue–residue contact predictions in a sequence-and structure-rich era. Proceedings
of the National Academy of Sciences, 110(39), 15674-15679.
13. Jones, D. T., Buchan, D. W., Cozzetto, D., & Pontil, M. (2012). PSICOV: precise structural
contact prediction using sparse inverse covariance estimation on large multiple sequence
alignments. Bioinformatics, 28(2), 184-190.

163
14. Schneider, M., & Brock, O. (2014). Combining physicochemical and evolutionary
information for protein contact prediction. PloS one, 9(10), e108438.
15. Brunette, T. J., & Brock, O. (2005). Improving protein structure prediction with model-based
search. Bioinformatics, 21(suppl 1), i66-i74.
16. Mabrouk, M., Werner, T., Schneider, M., Putz, I., & Brock, O. (2015). Analysis of free
modeling predictions by RBO aleph in CASP11. Proteins: Structure, Function, and
Bioinformatics.
17. Rohl, C. A., Strauss, C. E., Misura, K. M., & Baker, D. (2004). Protein structure prediction
using Rosetta. Methods in enzymology, 383, 66-93.

164
RBO-Epsilon

Contact Prediction by RBO-Epsilon in CASP12

Kolja Stahl1, Mahmoud Mabrouk1, Michael Schneider2 and Oliver Brock1


1 – Robotics and Biology Laboratory, Technische Universität Berlin, 2 - Chair of Bioanalytics, Institute of
Biotechnology, Technische Universität Berlin
oliver.brock@tu-berlin.de

To predict contacts with high accuracy, it is vital to leverage as much and diverse information as
possible. RBO-Epsilon therefore combines evolutionary, sequence-based, and physicochemical
information. These sources of information are complementary. By combining them effectively,
we can compensate the shortcomings of one type based on the strength of another. Our approach
for this is based on deep learning and utilizes stacking. Stacking treats the combination process
as a learning problem. With the help of indicator features we learn to leverage the most effective
source of information. We also simplified the feature set conventionally used so that we can learn
on more data and increase model complexity. The model consists of 4 hidden layer with 400-
200-200-50 neurons.
On 21 hard CASP11 FM targets, RBO-Epsilon achieves a mean precision of 36.1% for
top L/10 long-range contacts, 8% higher relative to the current state-of-the-art MetaPSICOV. On
1.5L the improvement is 16%.

Methods
RBO-Epsilon combines evolutionary, sequence-based, and physicochemical information. The
physicochemical information stem from EPC-map1, which ranked amongst the top contact
predictors in CASP11. We use the sequence-based feature set employed by MetaPSICOV2
stage1, that include the amino acid composition, secondary structure prediction, solvent
accessibility and column entropy amongst other features. Building on the idea of PconsC 3 and
MetaPSICOV to include multiple different co-evolutionary information, we extend the feature
set to include the prediction of GaussDCA4 and GREMLIN5, in addition to CCMpred6,
FreeContact7 and PSICOV8.
The feature set is initially high dimensional with 672 features. Using a feature importance
analysis, the dimensionality (including the newly added features) is reduced to 171, enabling the
use of a more complex neural network. The feature importance is computed by XGBoost 9
(eXtreme gradient boosting), a decision tree-based approach. XGBoost partitions the dataset
based on features that best separates the classes (here contacts and non-contacts). Features that
are higher up in the tree are deemed more important. The final feature set is a mix of high level
features (EPC-map prediction, co-evolutionary information) and crude sequence-based features
which may also act as indicator variables for the more high-level features. It can therefore be
seen as a variant of stacking.
The final model consists of a 4 hidden layer neural network with Maxout10 activation and
is trained on 28 million samples (contacts / non-contacts). For regularization we used Dropout11.
The network is implemented in Keras12.

165
Availability
RBO-EPSILON is available as part of the RBO Aleph Webserver at http://compbio.robotics.tu-
berlin.de/rbo_aleph/; replacing EPC-map as the contact predictor. The code is available from the
author on request.

1. Schneider, Michael, and Oliver Brock. (2014) Combining physicochemical and evolutionary
information for protein contact prediction. PloS one. 9(10). e108438.
2. Jones, David T., et al. (2015). MetaPSICOV: combining coevolution methods for accurate
prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31.7:
999-1006.
3. Skwark, Marcin J., Abbi Abdel-Rehim, and Arne Elofsson. (2013). PconsC: combination of
direct information methods and alignments improves contact prediction. Bioinformatics
29.14: 1815-1816.
4. Carlo Baldassi, Marco Zamparo, Christoph Feinauer, Andrea Procaccini, Riccardo Zecchina,
Martin Weigt and Andrea Pagnani. (2014) Fast and accurate multivariate Gaussian modeling
of protein families: Predicting residue contacts and protein-interaction partners. PLoS ONE
9(3): e92721
5. Kamisetty, Hetunandan, Sergey Ovchinnikov, and David Baker. (2013) Assessing the utility
of coevolution-based residue–residue contact predictions in a sequence-and structure-rich
era. Proceedings of the National Academy of Sciences 110.39: 15674-15679.
6. Seemayer, Stefan, Markus Gruber, and Johannes Söding. (2014) CCMpred—fast and precise
prediction of protein residue–residue contacts from correlated mutations. Bioinformatics
30.21 : 3128-3130.
7. Kaján, László, et al. (2014) FreeContact: fast and free software for protein contact prediction
from residue co-evolution. BMC bioinformatics 15.1: 1.
8. Jones, David T., et al. (2012) PSICOV: precise structural contact prediction using sparse
inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28.2:
184-190.
9. Chen, Tianqi, and Tong He. (2015) xgboost: eXtreme Gradient Boosting. R package version
0.4-2.
10. Goodfellow, Ian J., et al. (2013) Maxout networks. ICML (3) 28: 1319-1327.
11. Srivastava, Nitish, et al. (2014). Dropout: a simple way to prevent neural networks from
overfitting. Journal of Machine Learning Research 15.1: 1929-1958.
12. François Chollet. (2015) Keras. https://github.com/fchollet/keras

166
Rosetta_at_Kingston

Optimizing Rosetta according to a target’s structural class

Jad Abbass and Jean-Christophe Nebel


Kingston University, London, UK
K1064285@kingston.ac.uk

We have contributed models for 29 targets using a customized version of Rosetta so that it is
optimized according to the structural class a target is predicted to belong to.

Methods
Although our methodology relies mainly on the Rosetta ab initio protein structure predictor,
major amendments were made to both the fragments' library and the diversity of the 3-mers used
during the conformation refinement phase. Our new approach1 has demonstrated significantly
improved performance especially for targets rich in alpha helices.
For each of the main structural classes defined by SCOP, i.e. all alpha, all beta,
alpha+beta and alpha/beta, we have created a specific structure library from which fragments are
to be extracted. For each structural class, we have excised proteins annotated as belonging to that
SCOP class to create a customized 'vall'– the formal name used by Rosetta for the file containing
the set of PDB structures/domains that will serve as a fragments resource.
For a given target, first, its structural class is predicted from its sequence. Second, Rosetta
predicts its structure using the fragments extracted from the corresponding 'vall'. Third, to limit
the diversity of the fragments and allow better exploration of local neighborhoods when dealing
with the 'easier targets', i.e. alpha, alpha+beta and alpha/beta, we have decreased the number of
3-mers from 200 to 25.

1. Abbass,J. & Nebel,J-.C. (2015). Customised fragments libraries for protein structure
prediction based on structural class annotations. BMC Bioinformatics. 16-136.

167
RRCPred

Residue-Residue Contact Prediction

Piyush Agrawal, Gandharva Nagpal, Deepti Sethi, Sandeep Singh and Gajendra P.S. Raghava*
Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh (160036), India
*raghava@imtech.res.in

Residue-residue contact maps of a protein help in determining its 3D structure, model ranking,
functional annotation and in structure-based drug designing [1]. Although there are number of
methods available for the predicting the interacting partners still the field is still emerging. In this
study we have developed a method named RRCPred to predict the interacting residues using
binary profile of the residues.

Methods
We extracted all the information of the residues in contact using the structures of CASP8,
CASP9 and CASP10. We called those residues in contact in which distance between their Cβ
atoms (Cα atoms in case of Glycine) is less than 8Å and separation between two residues should
be of minimum 6 residues (CASP criteria) [2]. We divided our data into following four sets based
on distance between pair of contacting residues; i) 7-10 residues (group1), ii) 11-20 residues
(group2), iii) 21-30 residues (group3) and iv) more than 30 residues (group4). We generated
binary profiles of all the contacting residues with window length of 9 and developed model using
Logistic classifier of WEKA [3] for all the four groups. We evaluate the performance of these
models on blind dataset generated from CASP11 targets.

Results
We measure the performance of these models in terms of accuracy and Matthew’s Correlation
Coefficient (MCC). We obtained accuracy of 74.13% and MCC of 0.48 on the correct prediction
of interacting residues for group1 (i.e. short-range contacts), accuracy of 64.74% and MCC of
0.29 for group2. Similarly for group3 we obtained accuracy of 62.65% and MCC of 0.25 and
lastly for group4 (i.e.) long-range contacts we obtained maximum accuracy of 58.20% and MCC
of 0.16.

Availability

RRCPred method is implemented in the form of webserver, which can be accessed at


http://crdd.osdd.net/raghava/rrcpred/.

1. Adhikari B and Cheng J. Protein residue contacts and prediction methods. Methods Mol
Biol. 2016; 1415: 463–476.
2. Monastyrskyy B,Fidelis K,Tramontano A,Kryshtafovych A. Evaluation of residue-residue
contact predictions in CASP9. Proteins, 2011; 79 (Suppl 10): 119–125.
3. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. &Witten, I. H. The WEKA ata
Mining Software: An Update.SIGKDD Explorations 11, 10–18 (2009).

168
Seok-assembly, Seok (Assembly)

Prediction of Protein Complex Structures by GALAXY in CASP12

Minkyung Baek, Taeyong Park, Jinsol Yang, Sangwoo Park, Beomchang Kang, Hwan Won
Chung, Jonghun Won, Lim Heo and Chaok Seok
Department of Chemistry, Seoul National University, Seoul 151-747, Republic of Korea
chaok@snu.ac.kr

The assembly server Seok-assembly made fully automated structure predictions for homo-
oligomer and hetero-oligomer targets in CASP12. The human group Seok submitted models
generated by using GALAXY with human intervention.

Methods

The overall pipeline for Seok-assembly is presented in Figure 1. For homo-oligomer targets with
unknown oligomeric states, oligomeric states were predicted by a consensus method using the
HHsearch1 results. From the HHsearch results and the given/predicted oligomeric state, target
difficulty was estimated in terms of predicted MM-score. A template-based approach was applied
to easier targets with predicted MM-score > 0.4, and an ab initio approach was applied to harder
targets. In the template-based approach, up to five template proteins were selected among the
proteins detected by HHsearch after re-ranking them based on the fold recognition scores and
similarity among templates. Oligomer models were built from each selected template using
GalaxyCassiopeia, the model-building component of GalaxyTBM2. Unreliably modeled regions
(ULRs) were detected and re-modeled by using GalaxyLoop3,4 with symmetry restraints. In the
ab initio approach, monomer structures were predicted by using GalaxyTBM. After removing
ULRs in the monomer structures to prevent interference with homo-oligomer structure
formation, homo-oligomer structures were predicted by using an in-house program called
OligoTongDock. OligoTongDock performs FFT-based low-resolution protein-protein docking
considering oligomer symmetry. Five models were selected after clustering models generated by
OligoTongDock, and the deleted ULRs were re-built on them. The resulting models were further
refined using GalaxyRefineComplex5 with symmetry restraints. For hetero-oligomer targets,
complex structures were predicted using the ab initio protein-protein docking program
GalaxyPPDock with subunit structures predicted by the Seok TS protocol. The human
predictions followed the overall pipeline except that template selection and final model selection
were made by human.

169
Figure 1. Seok-assembly pipeline tested in CASP12

Results
Seok-assembly pipeline was tested on a benchmark set of 20 CASP11/CAPRI30 homo-oligomer
structure prediction targets. In terms of the CAPRI evaluation measures, 10 and 2 targets were
predicted with medium and acceptable quality, respectively. For comparison, HADDOCK, the
best performing server in CASP11/CAPRI30, predicted 9 targets with medium quality and 6
targets with acceptable quality.

Availability
The GALAXY programs used in Seok-assembly are available on the GalaxyWEB web page at
http://galaxy.seoklab.org.

1. Soding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics


21, 951-60.
2. Ko,J., Park,H. & Seok,C. (2012). GalaxyTBM: template-based modeling by building a
reliable core and refining unreliable local regions. BMC Bioinformatics 13, 198.
3. Park,H. & Seok,C. (2012). Refinement of unreliable local regions in template-based protein
models. Proteins 80, 1974-86.
4. Park,H., Lee,G.R., Heo,L. & Seok,C. (2014). Protein loop modeling using a new hybrid
energy function and its application to modeling in inaccurate structural environments, PLoS
ONE 9, e113811.
5. Heo,L., Lee,H. & Seok,C. (2016). GalaxyRefineComplex: Refinement of protein-protein
complex model structures driven by interface repacking, Scientific Report 6, 32153.

170
Seok-server, Seok (Refinement)

Automatic protein structure refinement with an improved energy function and diverse
sampling of unreliable regions

Gyu Rie Lee, Lim Heo and Chaok Seok


Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea
chaok@snu.ac.kr

Current protein structure refinement methods face the problem of consistently improving
structures in the backbone level with computational efficiency. In the CASP11 refinement
experiment, we introduced global structural changes by normal mode analysis and secondary
structure perturbation1,2, but restraints used to avoid drift-away from the initial structure
impeded large conformational changes. In CASP12, we adopted an improved energy function
and applied fewer restraints. In addition, the detection and modeling methods for unreliable local
regions (ULRs) that were successfully tested in CASP11 by human group were automated to be
part of the CASP12 protocol. The new refinement server method showed improvement in both
global and local evaluation measures. For the human prediction Seok, the same refinement
pipeline was applied except that ULR detection, modeling, and selection were performed
manually by using loop modeling3.

Methods
The newly developed refinement server performed up to ten iterations of successively deeper
sampling of the low-energy regions of the conformational space. This server method is outlined
in Figure 1. After each iteration step, 35 conformations with the lowest energies that are distant
from each other were selected as seed conformations for the next iteration. Low-energy
conformations were generated by diverse sampling methods. Single model-based residue-level
quality assessment method was applied to the given initial structure to detect unreliable regions,
and fragment assembly and loop sampling methods were applied to these regions 4. The
experimentally resolved structures of homologous proteins were also used to detect locally
variable regions and to generate hybrid structures. Global structural changes were introduced by
using normal mode analysis and secondary structure perturbation. All new conformations were
subject to 3 or 1.2 ps of molecular dynamics relaxations depending on the magnitudes of
introduced structural changes.
An improved hybrid energy function that includes a new knowledge-based potential was
employed for the refinement. The new knowledge-based potential considers dependence of the
atom pair potential on the solvation states of the interacting atoms5 as well as their pair distance.
This hybrid energy function was optimized for both relaxing and scoring conformations. To
allow for more conformational changes during relaxation, 90% of harmonic restraints derived
from the initial structure to apply were determined by Bayesian inference at every time step 6.
After iteration, all conformations were collected and scored with the energy function without
restraints. The five lowest-energy models were selected after reducing redundancy.

Results

171
A benchmark test of the new refinement method was performed on a set consisting of 62
previous CASP10 and 11 refinement targets. Structure qualities of the refined models were
measured in improvements in GDT-HA, GDC-SC, lDDT, and MolProbity. Target-averaged
improvements of model 1 structures were 0.98, 2.42, 0.018, and 1.24, respectively. This is
improved or comparable performance when compared to the CASP11 refinement server method
which showed average improvements of 0.65, 2.14, 0.012, and 1.25, respectively. As with the
CASP11 refinement method, the new refinement method could improve structures consistently.
For each evaluation measure, 69.4%, 74.2%, 80.6%, and 96.8% of the targets could be improved
using the new refinement method, respectively.

Figure 1. Flowchart of the Seok-server refinement method used in CASP12.

Availability
Web service of the simplified version (CASP10 version) of the method is available at
http://galaxy.seoklab.org/refine. Loop modeling method is also available publically at
http://galaxy.seoklab.org/loop. The current refinement method is planned to be constructed as a
web server.

1. Lee,G.R., Heo,L. & Seok,C. (2015) Effective protein model structure refinement by loop
modeling and overall relaxation. Proteins in press.
2. Heo,L., Park,H. & Seok,C. (2013) GalaxyRefine: Protein structure refinement driven by side-
chain repacking. Nucleic Acids Res. 41 (W1), W384-W388.
3. Lee,G.R., Park,H., Heo,L. & Seok,C. (2014) Protein loop modeling using a new hybrid
energy function and its application to modeling in inaccurate structural environments. PLoS
ONE 9 (11): e113811.
4. Lee,J., Lee,D., Park,H., Coutsias,E.A. & Seok,C. (2010) Protein loop modeling by using
fragment assembly and analytical loop closure. Proteins 78, 3428-3436.
5. Heo,L. & Seok,C. A new statistical potential with consideration of solvation effects for
protein simulations. in preparation.
6. MacCallum,J.L., Perez,A. & Dill,K.A. (2015) Determining protein structures by combining
semireliable data with atomistic physical models by Bayesian inference. Proc Natl Acad Sci
U S A 112 (22), 6985–6990.

172
Seok-server (TS, QA), Seok-refine (TS)

GALAXY in CASP12: Fully automated protein structure prediction, model quality


assessment, and refinement

Lim Heo, Gyu Rie Lee, Minkyung Baek, Taeyong Park, Jonghun Won, Beomchang Kang, Jinsol
Yang, Hwan Won Chung, Sangwoo Park, and Chaok Seok
Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea
chaok@snu.ac.kr

Seok-server performed fully automated protein tertiary structure predictions1 for TS targets.
Seok-server also submitted predictions for QA targets using GalaxyQA2, an energy-based, non-
consensus QA method. As a meta-server, Seok-refine submitted predictions for TS targets by
refining models selected by using GalaxyQA. A simplified version of GalaxyRefine3,4 was used
for the refinement.

Methods
Seok-server predictions for TS targets with GalaxyTBM
The overall pipeline for protein tertiary structure prediction consists of five steps: (1) domain
detection and template search, (2) multiple sequence alignment, (3) tertiary structure building,
(4) ULR (unreliable local region) detection, (4) ULR refinement, and (5) overall structure
refinement. For a given target sequence, domains are detected by GalaxyDom5 which utilizes
HHsearch6 against the SCOP70 and PDB70 databases. For each domain, HHsearch is used to
search for templates from PDB70 and the search results are re-ranked to select templates
depending on the target difficulty estimated by a linear regression method. Tertiary structures are
built by GalaxyCassiopeia using multiple sequence alignment generated by Promals3D7. In this
step, 24 models are constructed by short VTFM MD simulations with template-driven restraints
and CHARMM22 force field. A residue-wise quality assessment method is applied to detect
ULRs, where a linear regression model using RMSF of constructed models, multiple sequence
alignment, and fragment library is used. Long loop ULRs (> 15 residues) and terminal ULRs are
constructed by FALC8 and short MD relaxations, and short loop ULRs (<= 15 residues) are built
by GalaxyLoop9. The full structures are finally refined by repetitive side chain repacking and
short MD relaxations.
Seok-server predictions for QA targets with GalaxyQA
For QA targets, an energy-based, non-consensus model quality assessment method called
GalaxyQA2 was applied to predict model accuracy. For each target, all the given server models
were locally minimized in the GalaxyRefine3,4 energy. The energy-minimized models were
ranked based on a new knowledge-based potential called KGB2, which depends on solvation
states of interacting atoms as well as their distances and orientations. Solvation states of atoms
are described by the Born radii of FACTS10, a generalized Born solvation model. With this
approach, previously unseen interaction features could be captured especially for charged and
polar interactions. Model quality scores are assigned between 0 and 1. For the top-ranked model,
the estimated target difficulty for the TS target was assigned. For the worst model, TM-score to
the top-ranked model multiplied by the estimated target difficulty was assigned. For the
remaining models, linearly scaled scores between them are assigned based on the KGB energy.

173
This method was benchmarked on CASP10/11 single domain TBM targets (with the best GDT-
TS > 40). Average loss from the best model in GDT-TS was 8.44, compared to 9.00, 9.91, 13.54,
and 15.43 for GOAP, SOAP, dDFIRE, and RWplus, respectively. The Pearson’s correlation
coefficient between the predicted and observed accuracy in GDT-TS was 0.627, compared to
0.488, 0.270, 0.312, and 0.324 for the other four methods.
Seok-refine predictions for TS targets with GalaxyQA and GalaxyRefine
Seok-refine is a meta-server which refines the server models selected by a quality assessment
method. All server models were furst scored using Pcons11, a consensus-based method, and top
48 models were then ranked using GalaxyQA. The model with the highest QA score was refined
using the most recent version of GalaxyRefine but with reduced computation. More specifically,
the maximum number of conformations was reduced to 900, and top 24 models after QA were
used instead of homologous protein structures for hybridization. Five lowest-energy models were
finally submitted. This method was benchmarked on 103 CASP11 targets. Target-averaged
quality of the top-ranked server models was 38.1/20.3 in GDT-HA/GDC-SC. These models were
improved by refinement to 38.9/21.1. The average structure quality of the best-performing server
was 37.7 and 17.5, respectively.

Availability
GalaxyTBM and GalaxyRefine are available on GalaxyWEB (http://galaxy.seoklab.org).
GalaxyReinfe is also available as a standalone version (http://seoklab.github.io/GalaxyRefine).
The KGB energy function and GalaxyQA method will be made available soon.

1. Ko,J., Park,H., & Seok,C. (2012) GalaxyTBM: template-based modeling by building a


reliable core and refining unreliable local regions. BMC Bioinformatics 13, 198.
2. Heo,L. & Seok,C. A new statistical potential with consideration of solvation effects for
protein simulations. in preparation.
3. Lee,G.R., Heo,L. & Seok,C. (2015) Effective protein model structure refinement by loop
modeling and overall relaxation. Proteins in press.
4. Heo,L., Park,H. & Seok,C. (2013) GalaxyRefine: Protein structure refinement driven by side-
chain repacking. Nucleic Acids Res. 41 (W1), W384-W388.
5. Choe,K., Heo,L, Ko,J. & Seok,C. GalaxyDom: a method to detect modeling units for protein
structure prediction. in preparation.
6. Söding,J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21
(7) 951-960.
7. Pei,J., Kim,B. & Grishin,N. (2008) PROMALS3D: a tool for multiple sequence and structure
alignment. Nucleic Acids Res. 36 (7), 2295-2300.
8. Lee,J., Lee,D., Park,J., Coutsias,E.A. & Seok,C. (2010) Protein loop modeling by using
fragment assembly and analytical loop closure. Proteins. 78, 3428-3436.
9. Lee,G.R., Park,H., Heo,L. & Seok,C. (2014) Protein loop modeling using a new hybrid
energy function and its application to modeling in inaccurate structural environments. PLoS
ONE 9 (11): e113811.
10. Haberthur,U. & Caflisch,A. (2008) FACTS: Fast analytical continuum treatment of solvation.
J. Comput. Chem. 29 (5), 701-715.

174
Shen-Group

R2C 2.0: Ab initio residue contact map prediction using dynamic fusion strategy and
Gaussian noise filter

Jing Yang1,2 and Hong-Bin Shen1,2


1 Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, 2 Key Laboratory of
System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
hbshen@sjtu.edu.cn

Residue-residue contacts are crucial for protein folding and structural stability. Inter-residue
contacts in proteins also dictate the topology of protein structures. In the literature, there are
general two different types of methods that predict residue contact map based solely on primary
protein sequence, i.e., machine learning (ML)-based and correlated mutation analysis (CMA)-
based approaches. Here, we present an updated 2.0 version of R C1 to predict residue contacts
2
by fusing multiple base predictors to make full use of diversities, which is composed by both
ML-based and CMA-based approaches. As we have shown in R C1 that the outputs from the
2
ML-based method are concentrated with better performance on short-range contacts; while for
the CMA-based approach, the predictions are widespread with higher accuracy on long-range
contacts. The other observation is that the performance of CMA-based approaches significantly
depend on the effect size of multiple sequence alignment (MSA). Thus an effective MSA size-
based query-driven dynamic fusion strategy was utilized to take full advantages of the two
different methods, resulting in an impressive overall accuracy improvement than any single type
of predictor. We also found that the contact map directly from the CMA-based prediction model
contains the interesting Gaussian noise. Therefore, we performed noise reduction to optimize the
CMA-based predictions before the dynamic fusion with the ML-based predictions.

Methods

The newly developed R2C 2.0 predictor is composed by two engines: ML-based and CMA-based
prediction modules. The ML-based module is implemented with ensemble classifier framework,
which is the fusion of five support vector machine (SVM) classifiers for each range (short,
medium, and long). Compared to our original version of R2C, we used more sequential features
as the SVM inputs to improve the diversity coverage of sequence features, such as amino acid
composition, conservation and correlated mutations. The CMA-based engine relies on mfDCA2,
PSICOV3 and GREMLIN4 to detect direct couplings from multiple sequence alignment (MSA),
which was generated by using HHblits5 to search against the bundled UniProt20 database. Our
local results show that by using three different CMA-based algorithms, we are able to reduce the
calculation bias from a single approach and is better than the initial R2C version that only used
PSICOV in the CMA-based branch6. The outputs from these three CMA-based approaches were
linearly combined, and a Gaussian noise filter was applied to remove Gaussian noise existed in
the raw contact map obtained from CMA-based predictions1. And then, these two modules were
merged together, where the fusion weights are correlated to the size of MSA of the query

175
sequence since it will directly affect the performance of CMA-based predictions. This strategy
results in an interesting query-driven dynamic fusion strategy and we found it is better than other
tested fusion approaches, such as average etc.

Results

We comprehensively tested the R2C 2.0 version with our previous R2C 1.0 on CASP11 30 free
modeling (FM) targets locally, the original R2C 1.0 can achieve a mean accuracy of 22.8% for
the top L/5 long-range contacts. For the updated version of R2C 2.0, the overall prediction
accuracy can reach as high as 32.3% for the top L/5 long-range contacts. Table 1 shows the
detailed results of top L/5 long-range contact prediction.

Table 1. Top L/5 long-range contact prediction performance comparison of the updated 2.0 and original
1.0 versions of R2C on CASP11 30 FM targets.
Target Acc1 a Acc2 b Target Acc1 a Acc2 b
T0761-D1 5.6 16.7 T0806-D1 64.7 60.8
T0761-D2 8.7 8.7 T0808-D2 40.7 37.0
T0763-D1 15.4 15.4 T0810-D1 30.4 26.1
T0767-D2 30.6 50.0 T0814-D1 55.6 44.4
T0771-D1 0.0 16.7 T0814-D2 34.8 56.5
T0777-D1 2.9 14.5 T0820-D1 5.6 5.6
T0781-D1 12.5 20.0 T0824-D1 45.5 50.0
T0785-D1 9.1 13.6 T0827-D2 6.7 40.0
T0789-D1 48.3 44.8 T0831-D2 2.6 23.1
T0789-D2 16.0 44.0 T0832-D1 11.9 2.4
T0790-D1 29.6 37.0 T0834-D1 0.0 0.0
T0790-D2 15.4 50.0 T0834-D2 5.9 23.5
T0791-D1 56.7 76.7 T0836-D1 7.3 51.2
T0791-D2 28.6 32.1 T0837-D1 20.8 29.2
T0794-D2 58.8 44.1 T0855-D1 13.0 34.8
a
Performance of the initial 1.0 version of R2C.
b
Performance of the updated 2.0 version of R2C.

1. Yang,J. et al. (2016) R2C: improving ab initio residue contact map prediction using dynamic
fusion strategy and Gaussian noise filter. Bioinfomatics, 32, 2435-2443.
2. Morcos,F. et al. (2011) Direct-coupling analysis of residue coevolution captures native
contacts across many protein families. Proc. Natl. Acad. Sci. USA, 108, E1293-E1301.
3. Jones,D.T. et al. (2012) PSICOV: precise structural contact prediction using sparse inverse
covariance estimation on large multiple sequence alignments. Bioinformatics, 28, 184-190.
4. Kamisetty,H. et al. (2013) Assessing the utility of coevolution-based residueresidue contact
predictions in a sequence- and structure-rich era. Proc. Natl. Acad. Sci. USA, 110, 15674-
15679.
5. Remmert,M. et al. (2012) HHblits: lightning-fast iterative protein sequence searching by
HMM-HMM alignment. Nat. Methods, 9, 173-175.
6. Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res., 25, 3389-3402.

176
SHORTLE

Energy-Based Refinement using a Genetic Algorithm with Multiple Selection Terms and
Survival Functions

D.R. Shortle
The Johns Hopkins University School of Medicine
dshortl1@jhmi.edu

Our fundamental strategy for structure prediction and refinement remains the same: 1) an
embellished genetic algorithm using 1x and 2x recombination events as the principal moves; 2) a
set of statistical potentials[1] that quantify side-chain and atom solvation[2], pair-wise
interactions[2], and φψ-energies[3] of one, two, and 3 residues segments in turn/loop/irregular
regions; 3) multiple selection terms and survival functions that are adjusted during refinement to
give a concerted reduction of approximately 30 energy terms and heuristics. This step requires
extensive human intervention to adjust both selection terms for new child models and survival
functions for picking models for the next generation. 4) use of the CASP server models in the
first step as scaffolds for assembly/exchange with a library of fragments with low φψ-energies;
5) 3 to 8 refinement rounds (a round = 4-8 generations) are carried out, with models from a
previous generation undergoing recombination in the next generation, first with the original
fragment library and then with others models in the population ensemble (size 25).

Methods
Changes to our protocols in CASP12 include: 1) prior to their use as templates, sets of server
models underwent several generations of recombination, with only the children surviving. This
reduces the frequency of low energy server models persisting in the ensemble and out-competing
higher energy models. 2) A much more extensive series of trial fragment swaps with a library of
4000 fragments (3 to 50 in length) recovered from ~25,000 PDB structures. More than 80
independent runs, using starting populations of 25 models, were pooled and the best 10-25%
selected on the basis of criteria to maintain a balanced reduction in energy values. 3) Constant
maintenance of heavy selective pressure to remove atom overlap throughout the refinement
process, especially overlap involving residues separated by fewer than 6 residues. 4) An energy
term equivalent to the Lennard-Jones attraction energy was added to the statistical potential for
atom-atom interactions, along with several packing quality terms based on the modal distances of
atom pairs in high resolution structures[4].

Results
During the prediction season, we submitted predictions for 64 human targets and 39 refinement
targets. Targets with chains that were broken or contained domain insertions or very long targets
were not pursued. Our current programs were not able to deal with these sproblems in chain
integrity, and for the largest targets there was insufficient manpower to commit to them within
the allotted time.

Compared to CASP11, we were able to drive refinement of most target proteins to lower values
of energies/heuristics and structurally away from the better server or refinement models. As to

177
the accuracy of our submitted models, we will find out at the CASP meeting in December.

Availability
The programs and potentials used in CASP12 will be made available once our new methods and
statistical potentials have been published and funding secured for continuing this work.

1. Shortle, D. (2003). Propensities, probabilities, and the Boltzmann hypothesis. Protein Science
12, 1298-1302.
2. Fang, Q and Shortle, D. (2006). Protein refolding in silico with atom-based statistical
potentials and conformational search using a simple genetic algorithm. J. Mol. Biol. 359,
1456-1467.
3. Fang, Q. & Shortle, D. (2005). A consistent set of statistical potentials for quantifying local
side-chain and backbone interactions. Proteins 60, 90-96.
4. Li A.J. & Nussinov R. (1998) A set of van der Waals and coulombic radii of protein atoms in
molecular and solvent-accessible surface calculation, packing evaluation, and docking.
Proteins 32, 111-127.

178
Skwark

Convolutional neural networks for protein contact prediction from coevolution alone.

M.J. Skwark1,2, V. Golkov3 and J. Meiler1,2


1 Department of Chemistry, Vanderbilt University, 2 Center for Structural Biology, Vanderbilt University, 3
Department of Compter Science, Technical University Munich
marcin@skwark.pl

See myprotein-me, Skwark

179
Spiders

Evolutionary Approach to Template Free Ab-Initio Protein Structure Prediction Guided by


Statistical Energy Function

Avdesh Mishra, Sumaiya Iqbal, Md Tamjidul Hoque


Computer Science, University of New Orleans, 2000 Lakeshore Drive, New Orleans, LA 70148
(amishra2, siqbal1, thoque)@uno.edu

The goal of template free ab-initio protein structure prediction (PSP) is to predict the spatial
conformation of a protein from its amino acid sequences only. Here, we used Genetic Algorithm
(GA) for conformational sampling with a statistical energy function (3DIGARS3.0)1. The energy
function is an optimized linear accumulation of hydrophobic and hydrophilic properties,
sequence-specific accessibility and ubiquitous phi-psi angular properties. The GA uses phi, psi
and omega angles to model a protein structure. Methods such as SPINE-X2, SPIDER23,
ANGLOR4 and DSSP5 were used to obtain dihedral angles and were used as the seed for GA.
Three other packages, namely TINKER6, Scwrl47 and RMSD8 were used as the integral part of
our method. We participated in CASP129 with an aim to assess our methodology for PSP. None
of the casp-hosted server predictions were used as a starting point of our method. Five best
models were obtained from the clustering method and submitted through prediction submission
system of CASP129.

Methods
We developed a template free ab-intio PSP method using an evolutionary approach for
conformation sampling which is further guided by our developed statistical energy function
3DIGARS3.01. Our method utilizes two different types of input format i) fasta sequence and ii)
pdb structure. Multiple backbone dihedral angle prediction methods (SPINE-X2, SPIDER23 and
ANGLOR4) were used to obtain the phi and psi angles from a given fasta sequence. DSSP 5 was
used to compute phi and psi angles from a given pdb structure. The pdb structures for a given
fasta sequence were obtained from I-TASSER10. After setting the chromosomes or, solutions in
GA with multiple seeds, the remaining of the chromosomes were populated by creating variation
in seeds by mutation operation. For mutation, the phi-psi angle pair at mutating place were
replaced by the best phi-psi angle pair obtained from the high probable regions of the
Ramachandran plot11. We use roulette wheel selection procedure to obtain the best phi-psi angle
from the Ramachandran plot11. In addition, we set the value of omega angle to 180 degree for all
the residues. The procedure mentioned above results in initialization of the GA population of size
200.
Once the GA population is initialized, we compute the fitness or score of each
chromosome using 3DIGARS3.01 energy function. To apply the energy function we needed the
structure in Cartesian coordinates. To achieve this, we used the protein and xyzpdb subprograms
from TINKER6. Then, we applied Scwrl47 program to reconstruct the side-chains. The elite rate
was set to 5%.
The single point crossover operation was applied at 90% rate. The remaining
chromosomes were randomly generated in next population to make the population size the same.
The single point mutation operation with a mutation rate of 50% was then applied on the new

180
population except the elites. A twin-removal12 operation was applied to remove the chromosomes
those were more than 80% similar.
After collecting the structures from all the available generations, we classify the
structures into five different top performing clusters, at least 5Å apart among each other based on
the root-mean-square deviation (RMSD)8. Top solutions from each of these five clusters were
submitted to CASP12.
We estimated the errors in our predicted structure and reported in place of the
temperature factor. To quantify the mobility of individual atoms in 5 different models, we
computed the Euclidean distance (d) of each atom (a) from all other atoms in a specific model,
ad1, ad2, …, ad5. Then, we calculated the mean of such distances for a specific atom, amean(d) and
took the displacement from this mean, amean(d)-d1, amean(d)-d2, …, amean(d)-d5, as the flexibilities in
prediction of structures of that atom in 5 different models. Further, we predicted the disorder
probabilities of the residues from sequence alone using our recently developed protein disorder
prediction tool, called DisPredict213. To capture the intrinsic flexibilities of the atoms into the
quantification of the possible error in their structure prediction, we multiplied the corresponding
residual disorder probabilities with the distances computed in the aforementioned way to
generate the final error value.

Availability
The ab-initio PSP software is available at
http://cs.uno.edu/~tamjid/Software/ab_initio/v1/PSP.zip

1. Mishra, A., Iqbal, S. & Hoque, M. T. (2016). Discriminate protein decoys from native by
using a scoring function based on ubiquitous Phi and Psi angles computed for all atom.
Journal of theoretical biology 398, 112-121.
2. Zhang, T., Faraggi, E. & Zhou, Y. (2010). Fluctuations of backbone torsion angles obtained
from NMR-determined structures and their prediction. Proteins: Structure, Function, and
Bioinformatics 78, 3353-3362.
3. Heffernan, R., Paliwal, K., Lyons, J., Dehzangi, A., Sharma, A., Wang, J., Sattar, A., Yang,
Y. & Zhou, Y. (2015). Improving prediction of secondary structure, local backbone angles,
and solvent accessible surface area of proteins by iterative deep learning. Scientific Reports
5.
4. Wu, S. & Zhang, Y. (2008). ANGLOR: A Composite Machine-Learning Algorithm for
Protein Backbone Torsion Angle Prediction. PLOS ONE 3 (10).
5. Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern
recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637.
6. Ponder, J. W. & Richards, F. M. (1987). An efficient newton-like method for molecular
mechanics energy minimization of large molecules. Journal of Computational Chemistry 8,
1016-1024.
7. Krivov, G. G., Shapovalov, M. V. & Jr., R. L. D. (2009). Improved prediction of protein
side-chain conformations with SCWRL4. Proteins: Structure, Function, and Bioinformatics
77, 778-795.
8. Lab, Z. RMSD Software, Vol. July 2014.
9. CASP12, 2016.
10. Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J. & Zhang, Y. (2015). The I-TASSER Suite:
Protein structure and function prediction. Nature Methods 12, 7-8.

181
11. RAMACHANDRAN, G. N., RAMAKRISHNAN, C. & SASISEKHARAN, V. (1963).
Stereochemistry of polypeptide chain configurations. Journal of Molecular Biology 7, 95-99.
12. Hoque, M. T., Chetty, M., Lewis, A. & Sattar, A. (2011). Twin Removal in Genetic
Algorithms for Protein Structure Prediction using Low Resolution Model. IEEE/ACM
Transactions on Computational Biology and Bioinformatics (TCBB).
13. Iqbal, S. & Hoque, M. T. (2016). Estimation of Position Specific Energy as a Feature of
Protein Residues from Sequence Alone for Structural Classification. PLoS ONE 11,
e0161452.

182
SSThread

Template-free protein structure prediction using SSThread

K.J. Maurice
kevin_maurice@hotmail.com

The SSThread1 algorithm was used to make tertiary structure predictions of individual domains.
SSThread first predicts the structure of pairs of α-helices and β-strands (secondary structure
elements [SSEs]) that are derived from experimental structures using a knowledge-based
potential (KBP), secondary structure prediction and contact map prediction. Then overlapping
pair predictions are assembled to generate an ensemble of core structure predictions. Loops and
side chains are then predicted.

Methods
Domain boundaries were predicted using SVMBounder (unpublished). SVMBounder uses
template alignments, evolutionary signals and sequence information, all of which are inputs into
Support Vector Machines.
SSThread uses a database that was created by clustering the structure of contacting SSE
pairs taken from a set of experimental structures. For each target, the structure of many SSE
pairs are predicted from the database and the target sequence. Then a non-stochastic greedy
search is conducted in which structures containing an increasing number of SSEs are generated
by merging two smaller structures that can be joined by an overlapping pair prediction resulting
in a set of core structure predictions. To predict the loops, segments from a set of experimental
structures that have similar end orientation to the gaps and are high scoring are selected. The
loop predictions are then closed using Cyclic Coordinate Descent2.
The KBP used for scoring includes an orientation dependent residue to residue contact
term, a backbone torsion angle term, a solvent accessibility term, a compactness term and an SSE
length term. The KBP uses homologous sequences identified by searching using HHblits 3
against the UniProt database. The secondary structure is predicted by PSIPRED4. Contact maps
are predicted by MetaPSICOV5. During SSE pair prediction an additional score is used to
ensure that the distribution of predictions among the residues of the target accurately reflects the
secondary structure prediction.
All-atom predictions were generated from the ensemble of backbone predictions by first
predicting side chain conformations using SIDEpro6. Then a brief energy minimization was
carried out by GROMACS7 using the AMBER038 forcefield with implicit solvent. The
predictions were clustered by RMSD so that the submitted predictions are non-redundant. The
all-atom KBP dDFIRE9 was used to select the top 10 predictions. The top 10 predictions were
then refined using GalaxyRefine10. The 5 predictions submitted were selected from the top 10
by manual inspection, preferring native-like structures.

Availability

183
SSThread and SVMBounder are available as stand-alone programs free for non-commercial use
at www.kjmaurice.com/downloads.html.

1. Maurice,K.J. (2014) SSThread: Template-free protein structure prediction by threading pairs


of contacting secondary structures followed by assembly of overlapping pairs. J. Comput.
Chem. 35, 644-656.
2. Canutescu,A.A., Dunbrack,R.L. Jr. (2003) Cyclic coordinate descent: A robotics algorithm
for protein loop closure. Protein Sci. 12, 963-972.
3. Remmert,M., Biegert,A., Hauser,A., Söding,J. (2011) HHblits: lightening-fast iterative
protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173-175.
4. Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring
matrices. J. Mol. Biol. 292, 195-202.
5. Jones,D.T., Singh,T., Kosciolek,T., Tetchner,S. (2015) MetaPSICOV: combining coevolution
methods for accurate prediction of contacts and long range hydrogen bonding in proteins.
Bioinformatics 31, 999-1006.
6. Nagata,K., Randall,A., Baldi,P. (2012) SIDEpro: a novel machine learning approach for the
fast and accurate prediction of side-chain conformations. Proteins 80, 142-153.
7. Berendsen,H.J.C., van der Spoel,D., van Drunen,R. (1995) GROMACS: a message-passing
parallel molecular dynamics implementation. Comp. Phys. Comm. 91, 43-56.
8. Duan,Y., Wu,C., Chowdhury,S., Lee,M.C., Xiong,G., Zhang,W., Yang,R., Cieplak,P., Luo,R.,
Lee,T., Caldwell,J., Wang,J., Kollman,P. (2003) A point-charge force field for molecular
mechanics simulations of proteins based on condensed-phase quantum mechanical
calculations. J. Comp. Chem. 24, 1999–2012.
9. Yang,Y., Zhou,Y. (2008) Ab initio folding of terminal segments with secondary structures
reveals the fine difference between two closely related all-atom statistical energy functions.
Protein Sci. 17, 1212-1219.
10. Heo,L., Park,H., Seok,C. (2013) GalaxyRefine: Protein structure refinement driven by side-
chain repacking. Nucleic Acids Res. 41, W384-W388.

184
SUMMA_pmf

KB_0.1_v2.0: Refinement Using an MM/PMF Hybrid with a Modified Disulfide Energy


Profile

Aaron Maus and Christopher M. Summa


Dept. of Computer Science, University of New Orleans, LA
amaus@uno.edu, csumma@uno.edu

We perform protein structure refinement via potential energy minimization (PEM) using a hybrid
molecular mechanics and potential of mean force (PMF)1 energy function based on the
previously published KB_0.12 potential, which we call KB_0.1_v2.0. The KB_0.1 potential and
its variants have met with some success in refinements tasks3. In KB_0.1_v2.0, the PMF
component, representing the non-bonded portion of the potential, has been updated overall and
modified to handle disulfide bridges.

Methods
After homology modelling, disulfide bonds have usually not been explicitly formed. Rather, the
cysteines will be in the same vicinity but remain reduced. As disulfide bonds serve as an anchor
in conformational space, a potential improvement to our refinement process could be to form
those bonds. The problem of deciding which cysteines should be bonded is an open and difficult
problem. We assume in this work that if, after homology modelling, cysteine sulfurs are close ( <
~4.5Å), then they should be bonded. This is not always a true assumption. There are proteins
with non-bonded cysteines that are within this range, but they are a minority of cases. Using the
original version of our hybrid potential, KB_0.1, minimization will not pull two sulfurs together
to form an oxidized disulfide. The reason is that the energies derived for atom pair interactions
are based on the number of interactions at given distance bins. The bond length of a disulfide
bond is 2.1 Å while the Van der Waals radius of sulfur is 1.8 Å. As a result, there are almost no
sulfur pairs in the distance range 2.2Å – 3.5Å. This results in an energy hill over this range and
prevents sulfurs that are close from being pulled into the strong energy well of the disulfide
bridge. As an experiment, we have hand-modified the energy functions for the CSG-CSG atom
pair and the CSG-CCB atom pair to smooth out the low distance energy hills and allow the
refinement process to form disulfide bridges. We test whether performing the refinement process
using the modified PMF will result in better structures for those with disulfide bridges than when
using the non-modified PMF.

1. Lu,H. & Skolnick,J. (2001). A Distance-Dependent Atomic Knowledge-Based Potential for


Improved Protein Structure Selection. Proteins. 44, 223-232
2. Summa,C. M. & Levitt,M. (2007). Near-native structure refinement using in vacuo energy
minimization. Proc Natl Acad Sci USA. 105, 3177-3182.
3. Rodrigues,J. P. G. L. M., Levitt,M. & Chopra,G. (2012). KoBaMIN: a knowledge-based
minimization web server for protein structure refinement. Nucleic Acids Res. 40, W323-
W328.

185
SUN_Tsinghua

All-Atom Conditioned Self-Avoiding Walk: An Ab Initio Protein Folding Method

Weitao Sun
Zhou Pei-Yuan Center for Applied Mathematics, Tsinghua Unviersity, Beijing, 100084, China
sunwt@tsinghua.edu.cn

All-Atom Conditioned Self-Avoiding Walk (AA-CSAW) is an ab initio protein folding


simulation model based on Monte-Carlo (MC) method2; 3; 6. The polypeptide chain is simulated
as effectively rigid cranks units lined by covalent bonds. Bond lengths and bond angles are set as
fixed optimal values. The structure of polypeptide is fully described by backbone dihedral angles
and the sidechain dihedral angles. The number of sidechain dihedral angles depends on the type
of amino acid. A trial structure is randomly generated by pivoting the polypeptide chain and
sidechains. In the pivot algorithm, the backbone dihedral angles for each residue are chosen in
Ramachandran plot according to a probability distribution derived from 3-residue fragment set.
The effective energy of protein structure is constructed by considering hydrophobic effect,
desolvation effect and hydrogen bonding interaction. An appropriate three dimensional structure
is accepted with a probability according to Metropolis scheme5. In order to evaluate the accepted
structures in MC simulations, the ratio of secondary structure content to radius of gyration is
introduced.
We do not use other methods as integral part of your method. AA-CSAW is used for all
predictions. AA-CSAW does not use templates, nor CASP-hosted server prediction(s) as a
starting point. We never use any methods for ranking predictions.

Methods

Observations of Protein Data Bank1 data show that the distribution of i , i in Ramachandran
plot is far from uniform. It seems that the dihedral angle values of residue i have obvious
relations with the amino acid types of residue i  1 and i  1 . We constructed dihedral angle
distribution models for all 20 amino acids based on a high resolution 3-residue fragment
database. This prior information substantially improve the accuracy and convergence of AA-
CSAW method.
Since crank model can provide atom locations for backbone atoms, the central problem is how to
determine the sidechain atom coordinates if the atom coordinates are known for a backbone
structure in arbitrary orientation. Thanks for the knowledge of amino acid structure, we have the
atom coordinates for sidechain in some special orientation. As a consequence, we can determine
the sidechain atom coordinates by matching amino acid to the backbone of a crank.
As the structure of 20 amino acids are well determined by experiment observation, we have
the atom coordinates for any type of residues, including backbone X obs obs
BB and sidechain X SC . The
only problem is that the observed amino acid structure are usually not in the same orientation as
in crank model. If the backbone parts of observed amino acid structure overlap with crank
BB  X BB , it is obvious that the crank sidechain atom will be determined by
model, i.e., Xobs crank

186
Xcrank
SC  Xobs
SC . By multiplying a rotation matrix
M obscrank , the observed amino acid structure can
easily overlap the crank.
The effective structure energy is composed of three parts: hydrophobic effect, hydrogen
bonding and desolvation energy. The hydrophobic energy is estimated based on two factors: the
solvent accessible surface area (SASA) and residue types. For residue i , if it has more neighbors,
it is buried in protein and has less SASA. In addition, if the surrounding residues are all
hydrophobic residues, the hydrophobic energy of residue i is high. A pair of residues are
considered in contact if any two non-hydrogen side chain atoms (NHSA) from residues i, j are
within a specified cutoff distance. In AA-CSAW, we use the Atom Distance criteria (ADC)
model7; 8 in residue contact determination. We introduce a scheme to decrease the hydrophobic
energy when the aggregation of hydrophobic residue grows to large size. This method provide
more chances to open the hydrophobic core, which is essential for misfolded intermediate
structures.
In AA-CSAW, the DSSP 4 method is used as HB criterion. The total number of hydrogen
bonds is a measurement of HB energy. Since the stability of hydrogen bond may depend on it
location, an optimal HB strength parameter is used as a weight.
A collapsed chain with hydrophobic core but without hydrogen bond is usually in high
free energy state. In order to prevent the formation of tight hydrophobic core without hydrogen
bonding, we introduce a penalty to buried NH , CO groups that can’t form hydrogen bonds for
some reasons.
The AA-CSAW is now a parallel code and can produce many candidate structures. We
find that the ratio of secondary structure content to radius of gyration is a pretty good indicator
for evaluating a structure. This value usually depends on the length of a protein. For the same
protein, the higher this ratio, the better the predicted structure.

Results

All results, intermediate data files, and performance analysis documents will soon be available
on the web at http://zcam.tsinghua.edu.cn/~sunwt/aacsaw.htm.

Availability

The AA-CSAW version 1.0.0 is written in C++ and have been compiled and tested on both
WindowsXP and LINUX systems. The software is available by sending email to
sunwt@tsinghua.edu.

1. Berman, H.M., et al. (2000) The Protein Data Bank, Nucleic Acids Res, 28, 235-242.
2. Huang, K. (2007) CONDITIONED SELF-AVOIDING WALK (CSAW): STOCHASTIC
APPROACH TO PROTEIN FOLDING, Biophysical Reviews and Letters 2, 139-154.
3. Huang, K. (2008) PROTEIN FOLDING AS A PHYSICAL STOCHASTIC PROCESS,
Biophysical Reviews and Letters 3, 1-18.
4. Kabsch, W. and Sander, C. (1983) Dictionary of Protein Secondary Structure - Pattern-
Recognition of Hydrogen-Bonded and Geometrical Features, Biopolymers, 22, 2577-2637.
5. Metropolis, N. (1987) The Beginning of Monte Carlo Method, Los Alamos Science, 15, 125–
130.

187
6. Sun, W. (2007) Protein folding simulation by all-atom CSAW method. 2007 Ieee International
Conference on Bioinformatics and Biomedicine Workshops, Proceedings.
7. Sun, W. and He, J. (2010) Understanding on the Residue Contact Network Using the Log-
Normal Cluster Model and the Multilevel Wheel Diagram, Biopolymers, 93, 904-916.
8. Sun, W.T. and He, J. (2011) From Isotropic to Anisotropic Side Chain Representations:
Comparison of Three Models for Residue Contact Estimation, Plos One, 6.

188
TASSER

TASSER in CASP12

Hongyi Zhou, Jeffrey Skolnick


Georgia Institute of Technology

In CASP12, we tested a new protocol for human expert prediction group TASSER.

For a given target, we first run the SP3 threading method to determine whether the target is Hard
target defined as Z-score of SP3 threading < 4.5 or not. If the target is Hard and has less than 200
residues, we generate around 50,000 models using a fragment-based ab initio approach
implemented in our lab as the one in the chunk-TASSER method, but folds full length models
instead of chunk models. Subsequently, three methods of model selection were employed to
select 20 models by each methods. The model selection methods are: DFIRE, GOAP & FTCOM
developed in our group. Then a 3D-jury approach was applied on the total 60 models to select
final best model for submission as model 1. Model 2,3,4 are models refined by TASSER from
the selected 20 models of each selection methods, respectively. Model 5 was selected from all
CASP server models by 3D-jury. For non-Hard targets or target has more than 200 residues, we
only use all CASP server models and apply the same model selection and refinement methods as
described above for ab initio models. We then add model one of TASSER-VMT (not participated
CASP12) as the fifth model of final submission.

189
TSlab-assembly

Prediction of Oligomeric Protein Structures based on Template-Based Modeling

Yasuomi Kiyota, Yudai Yamamoto, Katsuya Naito and Mayuko Takeda-Shitaka


School of Pharmacy, Kitasato University, Tokyo, Japan
shitakam@pharm.kitasato-u.ac.jp

We participated in the assembly category of CASP12. We predicted both homo- and hetero-
oligomeric protein structures according to the oligomeric state in the CASP12 target list. Our
modeling procedure was based on template-based modeling method.

Methods
Template search
We mainly used HHblits1,2 for template search. In order to find appropriate templates for as
many oligomer targets as possible, we constructed an original template database for oligomer
prediction. In CASP12, HHblits was run on this database. Symmetry operations were carried out
to generate oligomeric template structures when necessary. For template selection and alignment,
we also used the information of proteins obtained by PSI-BLAST3 and RPS-BLAST4.

3D-structure construction
We constructed 3D homo- and hetero- oligomeric structures according to the given oligomeric
state in the target list. For all oligomer targets, we constructed oligomeric structures based on the
oligomeric template structures using MODELLER5.
Moreover, for hard targets, we used CASP12 server models (monomer models). The server
monomer models were superimposed onto the oligomer templates using the interfacial residues
of template. Refinement was carried out to remove steric clashes between the interface residues.

1. Remmert,M., Biegert,A., Hauser,A., Söding,J. (2011) HHblits: Lightning-fast iteractive


protein sequence searching by HMM-HMM alignment. Nat Methods. 9(2), 173-175.
2. Söding, J., Biegert, A. & Lupas, A. N. (2005) The HHpred interactive server for protein
homology detection and structure prediction. Nucleic Acids Res. 33, W244-248.
3. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. & Lipman,D.J.
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389-3402.
4. Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH.
(2002) CDD: a database of conserved domain alignments with links to domain three-
dimensional structure. Nucleic Acids Res. 30, 281–283.
5. A. Sali and T.L. Blundell. (1993). Comparative protein modelling by satisfaction of spatial
restraints. J. Mol. Biol. 234, 779-815.

190
TSSPRED2

TSPPRED: A composite approach for tertiary structure prediction of proteins

Harinder Singh, Sandeep Singh and Gajendra P.S. Raghava*


Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh (160036), India
*raghava@imtech.res.in

There are no changes in the algorithm of TSPPRED which participated in the previous CASP11
experiments. However, the protein database of the HH-suite package 1 was updated to the latest
release before the starting of the CASP12 experiments. We also updated the in-house library of
representative dihedral angles. This library is used to predict the dihedral angles of the input
protein sequence which helps in generating the structure of the protein.

1. Remmert, M., Biegert, A., Hauser, A. & Soding, J. HHblits: lightning-fast iterative protein
sequence searching by HMM-HMM alignment. Nat Methods 9, 173-175, doi:nmeth.1818
[pii] 10.1038/nmeth.1818 (2012).

191
UNRES (TS)

Use of UNRES force field and replica-exchange molecular dynamics in physics-based


template-free prediction of protein structures

P. Krupa1, A.S. Karczyńska2, R. Ganzynkowicz2, M.D. Wiśniewska3, M.A. Mozolewska1,


M. Głuski2, Ł. Golon2, Y. He1, A.K. Sieradzan2, Y. Yin1, R. Ślusarz2, M. Ślusarz2, A. Giełdoń2,
C. Czaplewski2, H.A. Scheraga1 and A. Liwo2*
1 - Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY 14853-1301, U.S.A., 2 -
Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland, 3 - Centre of New
Technologies, University of Warsaw, Banacha 2c Str., 02-097 Warsaw, Poland
adam.liwo@ug.edu.pl

Physics-based approaches are, so far, less efficient than knowledge-based approaches in the
prediction of protein structures; however, their advantage is the independence of structural
databases. In the physics-based approaches, the predicted structure is sought out using the lowest
potential energy or as a mean structure of the family of conformations that has the lowest free
energy at room temperature; the latter approach follows the physics of protein folding more
closely. Use of all-atom force fields is impractical in such de novo simulations of protein
structures due to excessive time and huge computer resources required. On the other hand, de
novo simulations of even large proteins are feasible with coarse-grained force fields that use
highly reduced representations of polypeptide chains.
In the last several years, we have been developing the physics-based united-residue
(UNRES) force field for physics-based prediction of protein structures and large-scale
simulations of protein folding, together with a variety of methods for searching the
conformational space1. Recently we introduced various improvements in UNRES. This new
force field has been tested in the present CASP experiment.

Methods
In the UNRES model1, a polypeptide chain is represented by a sequence of alpha-carbon atoms
connected by virtual bonds with attached side chains. Two interaction sites are used to represent
each amino acid: the united peptide group (p) located in the middle of two consecutive alpha-
carbon atoms and the united side chain (SC). The interactions of this simplified model are
described by the UNRES potential derived from the generalized cluster-cumulant expansion of a
restricted free energy (RFE) function of polypeptide chains. The cumulant expansion enabled us
to determine the functional forms of the multibody terms in UNRES. The effective energy
function depends on temperature2. The force field used in the current CASP experiment was
optimized by using the maximum-likelihood that we developed recently3 in which optimization
is carried out to achieve the best agreement between the calculated conformational ensembles
and those determined by NMR at various temperatures encompassing the temperatures below the
melting point, the melting point and the temperatures above the melting points. Seven small
proteins (with chain length below 60 amino-acid residues; 2 α-helical, two β-, and three α+β-
proteins) were used to carry out the optimization. The force field also includes the recently
introduced additional torsional potentials that involve the side-chain centroids; these potentials

192
were determined based on the energy surfaces of terminally-blocked amino-acid residues
calculated with the semiempirical AM1 method4.
The structures of the target proteins were predicted by the following four-stage
procedure. First, UNRES was employed to carry out Multiplexed Replica Exchange Molecular
Dynamics (MREMD)5 for target proteins. To speed up the search for larger proteins, weak
restraints were imposed on secondary structure based on secondary structure prediction by
PSIPRED6. Second, based on MREMD simulation results, Weighted-Histogram Analysis
Method (WHAM) was used to calculate the relative free energy of each structure of the last
section of MREMD simulation1. Third, cluster analysis was employed to cluster the structures
from an MREMD simulation. Five clusters with the lowest free energies were chosen as
prediction candidates. Finally, in the fourth stage, the conformations closest to the respective
average structures corresponding to the found clusters were converted to all-atom structures
using the PULCHRA7 and SCWRL8 algorithms. These all-atom structures were subsequently
refined by running 500 steps of energy minimization, 0.3 ps of simulations and, finally, 500 steps
of minimization with the AMBER12 force field with the GBSA implicit solvent moment and
secondary-structure and positional restraints from the parent structures added. Such refined all-
atom structures were submitted to the CASP website.

Results
We postpone the assessment of the approach until the official release of CASP12 results.

Availability
The UNRES package is available at www.unres.pl.

1. Liwo,A., Czaplewski,C., Ołdziej,S., Rojas,A.V., Kaźmierkiewicz,R., Makowski,M.,


Murarka, R.K. & Scheraga,H.A. (2008) Simulation of protein structure and dynamics with
the coarse-grained UNRES force field. In: Coarse-Graining of Condensed Phase and
Biomolecular Systems., ed. G. Voth, Taylor & Francis, Chapter 8, pp. 107-122.
2. Liwo,A., Khalili,M., Czaplewski,C., Kalinowski,S., Ołdziej,S., Wachucik,K. &
Scheraga,H.A. (2007) Modification and optimization of the united-residue (UNRES)
potential energy function for canonical simulations. I. Temperature dependence of the
effective energy function and tests of the optimization method with single training proteins.
J. Pys. Chem. B 111, 260-285.
3. Zaborowski,B., Jagieła,D., Czaplewski,C., Hałabis,A., Lewandowska,A., Żmudzińska,W.,
Ołdziej,S., Karczyńska,A., Omieczynski,C., Wirecki,T. & Liwo.A. (2015) A maximum-
likelihood approach to force-field calibration. J. Chem. Inf. Model. 55, 2050-2070.
4. Sieradzan, A.K., Krupa,P., Scheraga,H.A., Liwo,A. & Czaplewski,C. (2015) Physics-based
potentials for the coupling between backbone- and side-chain-local conformational states in
the united residue (UNRES) force field for protein simulations. J. Chem. Theory Comput.,
11, 817-831.
5. Czaplewski,C., Kalinowski,S., Liwo,A. & Scheraga,H.A. (2009) Application of multiplexed
replica exchange molecular dynamics to the UNRES force field: Tests with α and α+β
proteins. J Chem. Theory Comput. 5, 627-640.

193
6. McGuffin,L.J., Bryson,K.& Jones,D.T. (2000) The PSIPRED protein structure prediction
server. Bioinformatics, 16, 404-405.
7. Rotkiewicz,P. & Skolnick,J. (2008) Fast procedure for reconstruction of full-atom protein
models from reduced representations. J. Comput. Chem., 29, 1460-1465.
8. Wang,Q., Canutescu,A.A. & Dunbrack,R.L. (2008) SCWRL and MolIDE: Computer
programs for side-chain conformation prediction and homology modeling. Nat. Protoc.
3,1832-1847.

194
UNRES (refinement)

All Atom Molecular Dynamics as a Refinement Tool

A. Gieldon, A. Karczynska, R. Slusarz, C. Czaplewski


Faculty of Chemistry, University of Gdańsk, Poland
artur.gieldon@ug.edu.pl

Coarse-grained models are very useful in prediction of the structure of the proteins since the
computation time needed is dramatically reduced. However, the results obtained by using of
those methods provide models with middle or low resolution. This is caused by simplifications
inherent in these models. In order to enhance the quality of the predicted structures and to
eliminate overlaps, which can appear during the united atom to all atom conversion, we
performed AMBER1 based refinement protocol. The extended version of the refinement protocol
was also used to compute refinement targets.

Methods
The obtained united-atom models were refined by using AMBER1 based protocol. The
identified sections of defined secondary structure were restrained by addition of parabolic
restraints on the φ and ψ angles with a force constant of 30 kcal/mol·rad2. To maintain protein
shape as much as possible with the prediction obtained by united residue forcefield (UNRES2)
the positional restraints for Cα atoms of 2 kcal/mol·Å2 were used. In the first step the models
were minimized (500 minimization steps). Following, 0,3 ps and T=40 K Molecular Dynamics 3
with implicit solvent treated by the Generalized-Born model was used to perform local
conformational search. Finally, the protein model was minimized as in the first step.
The idea for optimization of the TR refinement targets was similar. We decided to
perform minimization and low temperature MD simulation one after another and collect the data
as a refinement models. The protocol was summarized in the following table:

Take the pure model Apply the most probable rotamers


1. 500 steps minimization, 2 kcal/mol·Å2
2. 5 ps, 60 K GB MD, 2 kcal/mol·Å2
3. 500 steps minimization
4. 5 ps, 80 K GB MD, 2 kcal/mol·Å2
5. 500 steps minimization
6. 5 ps, 120 K GB MD, 1 kcal/mol·Å2
7. 500 steps minimization -> COLLECT MODEL
8. 5 ps, 120 K GB MD, 1 kcal/mol·Å2
9. 500 steps minimization -> COLLECT MODEL
10. 5 ps, 180 K GB MD, 1 kcal/mol·Å2
11. 500 steps minimization -> COLLECT MODEL

195
Results
The idea to design this protocol was, to find an refinement protocol which can give a good
results with low computational cost. The total simulation time for the largest protein does not
exceed 3 h with a single CPU. On the other hand, the optimized protein should be within the
radius of convergence. If the optimized structure will be far from the native structure, the
protocol will find only the local minimum.
In all refined models the average MolProbity3 and the clash score was 3,5 and 130, before
and 2.0 and 0, after refinement protocol, respectively.

1. Pearlman, D.A.; Case, D.A.; Caldwell, J.W.; Ross, W.S.; Cheatham, T.E. III; DeBolt, S.;
Ferguson, D.; Seibel, G.; Kollman, P. (1995) AMBER, a package of computer programs for
applying molecular mechanics, normal mode analysis, molecular dynamics and free energy
calculations to simulate the structural and energetic properties of molecules. Comp. Phys.
Commun., 91, 1–41.
2. Liwo, A., Oldziej, S., Czaplewski, C., Kleinerman, D.S., Blood, P., Scheraga, H.A. (2010)
Implementation of molecular dynamics and its extensions with the coarse-grained UNRES
force field on massively parallel systems; towards millisecond-scale simulations of protein
structure, dynamics, and thermodynamics. J. Chem. Theory Comput., 6, 890-909.
3. Dominy, B.N.; Brooks, C.L. III. (1999) Development of a generalized Born model
parameterization for proteins and nucleic acids. J. Phys. Chem. B, 103, 3765–3773.
4. Chen, V.B., Arendall III, W.B., Headd, J.J., Keedy, D.A., Immormino R.M., Kapral G.J.,
Murray, L.W., Richardson J.S., Richardson D.C. (2010) MolProbity: all-atom structure
validation for macromolecular crystallography. Acta Crystallographica D66: 12-21.

196
VoroMQA, VoroMQAsr, VoroMQA-select

Model Quality Assessment and Selection Using VoroMQA

K. Olechnovič1,2 and Č. Venclovas1


1 - Institute of Biotechnology, Vilnius University, 2 - Faculty of Mathematics and Informatics, Vilnius University
kliment@ibt.lt

We participated in CASP12 with two automated model accuracy estimation servers, VoroMQA
and VoroMQAsr, and a model selection method, VoroMQA-select, that was registered as a
regular tertiary structure prediction group. All our registered groups employed the latest version
of VoroMQA ("Voronoi diagram-based Model Quality Assessment"), a new method for the
estimation of protein structure quality that combines the idea of statistical potentials with the
advanced use of the Voronoi tessellation of atomic balls.

Methods
Given a protein structure, it can be represented as a set of atomic balls, each ball having a van
der Waals radius depending on the atom type. A ball can be assigned a region of space that
contains all the points that are closer (or equally close) to that ball than to any other. Such a
region is called a Voronoi cell and the partitioning of space into Voronoi cells is called a Voronoi
tessellation or a Voronoi diagram. Two adjacent Voronoi cells share a set of points that form a
surface called a Voronoi face. A Voronoi face can be viewed as a geometric representation of a
contact between two atoms. The Voronoi cells of atomic balls may be constrained inside the
boundaries defined by the solvent accessible surface (SAS) of the same balls. The procedure to
construct the described surfaces is implemented on top of the Voronota software1. It uses
triangulated representations of Voronoi faces and spherical surfaces, contact areas are calculated
as the areas of the corresponding triangulations.

Interatomic contact areas within the protein structure and those between the protein and
the solvent may be used to evaluate quality of protein structural models by employing the idea of
a knowledge-based statistical potential as was first shown in the work by McConkey et al.2
VoroMQA employs the same principle using more elaborate contact descriptions. We defined a
knowledge-based statistical potential function based on contact areas and defined a procedure of
transforming the pseudo-energy values calculated using the potential function into atom-level,
residue-level and full structure-level quality scores in the range from 0 to 1. The VoroMQA
scoring function was not optimized or trained in any way to correspond to any reference based
model quality-assessment scores. We applied only unsupervised learning procedure using
experimentally determined structures of protein biological assemblies as input. Details on the
VoroMQA development and implementation will be available in the upcoming paper.

While the VoroMQA server group applied the VoroMQA method to input models directly,
the VoroMQAsr server first preprocessed input models by rebuilding their side-chains using
SCWRL43: this was done to level the effects of side-chain packing intricacies in different
structure prediction methods. Both VoroMQA and VoroMQAsr servers were tested on the

197
CASP11 data. Both methods generally performed better than the other tested model quality
assessment methods based mainly on knowledge-based statistical potentials with VoroMQAsr
performing slightly better than VoroMQA. Importantly, both our current methods performed
significantly better than the old simplified version of VoroMQA that participated in CASP11.

The VoroMQA-select method used VoroMQA, VoroMQAsr and GOAP4 scores to


perform a tournament-based ranking procedure for the server models submitted for every
CASP12 target and available from http://predictioncenter.org/download_area. The rank of each
model was based on the number of other models that had greater VoroMQA and VoroMQAsr
scores. The GOAP score was used as an independent judge when VoroMQA and VoroMQAsr
disagreed. This model selection scheme was tested on the CASP11 data: it performed slightly
better than just selecting models with best VoroMQA or VoroMQAsr scores. When running
VoroMQA-select during CASP12, the VoroMQA method was also used to determine if model
structures contained unstructured N- or C-terminal regions that could be removed prior to
evaluation. This procedure was not fully automatic and required manual intervention for deciding
the exact cutting locations.

Availability
VoroMQA web application is available at http://bioinformatics.ibt.lt/wtsam/voromqa. VoroMQA
software for Linux is included in the Voronota package freely available from
http://bitbucket.org/kliment/voronota/downloads.

1. Olechnovic,K., and Venclovas,C. (2014) Voronota: a fast and reliable tool for computing the
vertices of the Voronoi diagram of atomic balls. J Comput Chem, 35(8), 672-681.
2. McConkey,B.J., Sobolev,V., and Edelman,M. (2004) Discrimination of native protein
structures using atom-atom contact scoring. Proc Natl Acad Sci USA, 100(6), 3215-3220.
3. Krivov,G.G., Shapovalov,M.V., and Dunbrack,R.L. (2009) Improved prediction of protein
side-chain conformations with SCWRL4. Proteins, 77(4), 778-795.
4. Zhou,H., and Skolnick,J. (2011) GOAP: a generalized orientation-dependent, all-atom
statistical potential for protein structure prediction. Biophys J, 101(8), 2043-2052.

198
Wallner

Combining ProQ2 and Pcons to improve model quality assessment, selection and
refinement

B. Wallner
Linköping University, S-581 83 Linköping, Sweden
bjornw@ifm.liu.se

In the past, we have tried numerous ways to combine a single-model MQAP, ProQ21,2, with
consensus based MQAP, Pcons3, but in the end a simple linear combination is as good as any
other more advanced combination. And it makes sense, the consensus methods usually fail to
pick up the best models when there is no consensus and then score will be mostly based on the
single-model MQAP, and when there is high consensus the single-model MQAP will select
models among these. For our predictions in CASP12 we used Pcomb=0.2*ProQ2+0.8*Pcons.
However, we did not limit ourselves to Pcomb but also manually inspected the top-ranked server
models for ProQ2 and Pcons as well. Below, we describe our combined approach in the TS and
TR categories.

Methods
The Wallner group participated in the TS, TR and QA categories in CASP12. In TS the category
we used ProQ2, Pcons and Pcomb to assess the local and global quality of all server models. The
models, and in particular the per residue predicted distance deviation of the globally top five
models by ProQ2, Pcons and Pcomb were manually inspected. High quality regions, i.e regions
with low predicted distance deviation to native were identified and restrained in a Rosetta relax
simulation using the predicted distance deviation. For each model, relaxed decoys were
generated for 384 hours (16*24h), resulting in 910 to 221,318 decoys per model depending in
protein size. A score that combined the Rosetta Energy and ProQ2 with equal weights were
calculated (1*RosettaEnergy+(-1)*ProQ2, -1 since for ProQ2 higher is better). The five models
with the best (lowest) combined score were selected and the ProQ2 local quality prediction was
added to the B-factor column.

For the TR category, a similar approach as for the TS category was used, except that we used the
refinement starting structure instead of models top ranked by ProQ2, Pcons or Pcomb. We now
manually inspected the predicted distance deviation by ProQ2, Pcons and Pcomb and chose a
distance deviation cutoff that selected which part of the structure we should restraint. Decoys
were generated for 384h (16*24h) using the coordinate constraint option in the relax protocol in
Rosetta, resulting in 1,509 to 97,194 decoys depending on protein size. As above we calculated a
combined Rosetta Energy and ProQ2 score and selected the five best models according to this
score.

For the QA category the quality assessment is based on Pcomb only, i.e
Pcomb=0.2*ProQ2+0.8Pcons. This identical to what was used in CASP11

199
Availability
The current version of ProQ2 is currently available as a scoring function in the Rosetta modeling
suite from http://www.rosettacommons.org. Additional scripts to generate all necessary input
files and command lines how to run ProQ2-refine can be found here:
http://github.com/bjornwallner/ProQ_scripts/.

1. Ray, A., Lindahl, E. & Wallner, B. Improved model quality assessment using ProQ2. BMC
Bioinformatics 13, 224 (2012).
2 Uziela K, Wallner B. ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics 32(9):1411-3 (2016).
3. Wallner, B. & Elofsson, A. Pcons5: combining consensus, structural evaluation and fold
recognition scores. Bioinformatics 21, (2005).

200
Wang1, Wang2, Wang3, Wang4 (QA)

Residue-Specific Quality Assessment of Individual Protein Models Using Ensembles of


Deep Networks, Support Vector Machines, and Random Forests

Tong Liu, Zheng Wang*


School of Computing, The University of Southern Mississippi
*Zheng.Wang@usm.edu

In this round of CASP, we redesigned four single-model prediction servers for quality assessment
(QA) of individual protein models, which were developed by improving our previous models1
used in CASP11. First of all, we introduced some new features, including structural properties of
three-dimensional protein models, PseAA composition, and coevolution-based direct-coupling
analysis. Second, Random Forests (RFs) was firstly used as one of our machine learning
algorithms, which can provide us with variable (feature) importance. Third, we developed four
prediction servers, each being up to three levels of ensembles of Support Vector Machines
(SVMs), Stacked Denoising Autoencoders (SdAs), and Random Forests (RFs) with different
configurations.

Methods
The features used in our previous work 1-3 were kept or replaced by using some new tools. The
major features we used in CASP12 experiment include: (a) the difference of secondary structure
and relative solvent accessibility predicted from amino acid sequence and parsed from three-
dimensional protein models; (b) segment overlap scores (SOV) between predicted and parsed
secondary structures; (c) HHblits profile and coevolution-based direct-coupling analysis for
predicting residue-residue contacts in comparison with the contacts parsed from protein models;
(d) structural properties of three-dimensional protein models, including folding degree and radius
of gyration; (e) the analysis of pseudo amino acid composition of the target sequences. In
general, the features can be categorized into three groups: (1) local model features (f_l) extracted
from 15-residue sliding windows centered at the residue of interest; (2) global model features
(f_g) extracted from the entire protein model; (3) global prediction scores (f_gs), which are the
predicted global scores from interstate machine learning models.
The training and cross-validation data sets were extracted from CASP 9. The predicted global
quality score of each protein model is a combination of all predicted residue deviations from
each of the four servers. Technically, we trained 16 machine learning models, including 10 SdAs,
three SVMs (SVM_1, SVM_2, SVM_3), and three RFs (RF_1, RF_2, RF_3), each model taking
fixed features as input. The ensemble details of our four QA prediction servers are shown in
Table 1:
(1) Wang1 uses the three categories of features and two SVMs (SVM_2 and SVM_3);
(2) Wang2 also uses the three classes of features and one SVM (SVM_1) and two RFs
(RF_1, RF_3);
(3) Wang3 excludes the features of global prediction scores and uses 10 SdAs and one
SVM (SVM_2);
(4) Wang4 also takes the three categories of features and uses 10 SdAs, one SVM
(SVM_1), and two RFs (RF_1, RF_2).

201
Table 1. Ensemble details of our four prediction servers (Wang1, Wang2, Wang3, and Wang4).
Server_ID f_l f_g f_gs 10SdAs SVM_1 RF_1 SVM_2 RF_2 SVM_3 RF_3
Wang1 √ √ √ √ √
Wang2 √ √ √ √ √ √
Wang3 √ √ √ √
Wang4 √ √ √ √ √ √ √

Availability
The four QA servers (Wang1, Wang2, Wang3, and Wang4) are available at
http://dna.cs.usm.edu/mass/. Our QA prediction results in CASP12 experiment can be found at
http://dna.cs.usm.edu/mass/data/CASP12/.

1. Liu, T., Wang, Y., Eickholt, J. & Wang, Z. Benchmarking Deep Networks for Predicting
Residue-Specific Quality of Individual Protein Models in CASP11. Scientific reports 6,
19301 (2016).
2. Wang, Z., Eickholt, J. & Cheng, J. APOLLO: a quality assessment service for single and
multiple protein models. Bioinformatics 27, 1715-1716 (2011).
3. Cao, R., Wang, Z., Wang, Y. & Cheng, J. SMOQ: a tool for predicting the absolute residue-
specific quality of a single protein model with support vector machines. BMC bioinformatics.
15, 120 (2014).

202
Wang1, Wang2, Wang3, Wang4 (RR)

Protein Residue-Residue Contact Prediction Using Deep Learning and Direct-Coupling


Analysis

Joseph Bailey Luttrell IV1, Tong Liu1, Zheng Wang1, *


1 - School of Computing, University of Southern Mississippi, 118 College Drive, Hattiesburg, MS 39406-0001
*zheng.wang@usm.edu

Our four methods for performing residue-residue contact prediction each took different
approaches to learning from the same base set of training data. This shared data was generated
from a set of the target proteins from CASP 9. In the case of our ensemble learning method
presented in Wang2, this same data was used with more features added to it.

Methods
The feature generation procedure used here was the common factor shared by all four of our
prediction methods tested in CASP12. The training portion of this procedure is composed of a
pipeline that accepts the amino acid sequence of target proteins along with their corresponding
atomic coordinates file. This information was obtained and processed for 87 target proteins from
CASP 9. PSIPRED1 version 3.5 and ACCpro2 version 5.1 (a member of the SCRATCH 1.0
package) were used to predict the secondary structure and solvent accessibility of each residue in
the target proteins. These predictions, along with information from the sequence itself, were
encoded into each example in the training set. A pair of sliding windows each centered at one of
the two residues in a potential contact pairing was used to pick the residues to compare. These
window center residues were always at least 6 residues apart in the sequence and were each
assigned a binary target value that was determined by measuring their Euclidean distance in the
target protein's atomic coordinate file. Positive examples in the training data were composed of
these residue pairs with alpha carbons that were 8 Å or less apart.
The final step in preparing the training dataset for usage was to deal with the large
imbalance of positive and negative examples by balancing the dataset. This was done by keeping
every positive example and randomly selecting a matching number of negative examples from
the much larger pool of negative examples. When this procedure is used to make predictions on
new targets, almost everything remains the same. However, the only input is the amino acid
sequence of the target protein. Also, the full set of examples picked up by the sliding window
system is evaluated since no balancing takes place during the prediction process.
After the training dataset had been generated, it was split into five pieces of equal size in
a way that maintained the balance of positive and negative examples. Our method presented in
Wang3 was trained on this data using an implementation of support vector machines called
SVM_light3. This training process involved performing many five-fold cross-validations with
different learning parameters to search for models with the best performance. After selecting just
one model, all predictions for Wang3 were made by using that model to classify the features
generated for each target protein. The predictions were then sorted by their confidence value, and
the highest scoring predicted contacts were reported for the final results of our Wang3 method.
The only difference in the training and prediction process of our learning ensemble used in
Wang2 was the inclusion of features obtained from contact predictions made by an

203
implementation of direct coupling analysis4 (DCA) for each residue pair selected by the sliding
window system.
These DCA predictions were made with a C++ implementation of the direct coupling
analysis methods introduced by Morcos et al4. First, HHblits5 was used to generate an MSA
(multiple sequence alignment) against the HMM (Hidden Markov Model) database
"uniprot20_2015_06". With this MSA, a sequence profile and contact map is obtained for every
possible contact in the target protein. Since no training was needed to use this method, the
resulting predictions from the contact map were simply sorted with the predicted contacts having
the highest confidence values listed first. This process was performed directly on the CASP 12
targets and the results were reported in our Wang4 method. For our Wang2 method, this process
was carried out on the training set of CASP 9 proteins and the resulting contact map predictions
were added as features back into the training data to create a learning ensemble before the final
SVM model was trained.
Our final contact prediction method used stacked denoising autoencoders based on
Theano to form a consensus of multiple deep networks as we have implemented before4; 5. Here,
the same base training data as used in Wang2 (before the DCA features were added) and Wang3
was split into four pieces instead of five. These pieces were used in different combinations as
pre-training, training, validation, and test data. Multiple Four-fold cross-validations were
performed with different learning parameters and 20 of the best performing models were selected
to participate in a consensus. Predictions were made for each target protein in CASP12 by using
each of the 20 models and averaging their results together. These results were then sorted by the
highest confidence value and the predicted contacts with the highest scores were reported as the
predictions for our Wang1 method.

1. McGuffin, L. J., Bryson, K. & Jones, D. T. (2000) The PSIPRED protein structure prediction
server. Bioinformatics 16, 404–405.
2. Cheng, J., Randall, A. Z., Sweredoski, M. J. & Baldi, P. (2005) SCRATCH: a protein
structure and structural feature prediction server. Nucleic Acids Res. 33, W72–W76.
3. Joachims, T. (1999) Making large-scale support vector machine learning practical. in
Advances in kernel methods 169–184 MIT Press.
4. Morcos, F. et al. (2011) Direct-coupling analysis of residue coevolution captures native
contacts across many protein families. Proc. Natl. Acad. Sci. 108, E1293–E1301.
5. Remmert, M., Biegert, A., Hauser, A. & Söding, J. (2012) HHblits: lightning-fast iterative
protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175.
6. Liu, T., Wang, Y., Eickholt, J. & Wang, Z. (2016) Benchmarking Deep Networks for
Predicting Residue-Specific Quality of Individual Protein Models in CASP11. Sci. Rep. 6,
1930.
7. Wang, Y. et al. (2016) Predicting DNA Methylation State of CpG Dinucleotide Using
Genome Topological Features and Deep Networks. Sci. Rep. 6, 19598.

204
wfAll-Cheng

Tertiary Structure Prediction by wfAll-Cheng

Jie Hou1, Badri Adhikari1, Renzhi Cao2, Silvia Crivelli3, and Jianlin Cheng1*
1 - Department of Computer Science, University of Missouri, Columbia, MO 65211, USA, 2 - Department of
Computer Science, Pacific Lutheran University, WA 98447, USA, 3 - Lawrence Berkeley National Laboratory,
Berkeley, CA 94720, USA
*chengji@missouri.edu

WeFold1 is a community-wide cooperation project for protein structure prediction within CASP.
As part of the WeFold collaborative, we evaluated all of the WeFold models generated by other
WeFold branches and selected 5 top ranked models, followed by model refinement.

Methods
Our wfAll-Cheng server firstly collected all of the WeFold models and CASP12 server models
for each target. Then the models with high similarity within each group, or non-full length
models were filtered out from the model pool. All the full-length models were evaluated by a
fully pairwise model comparison method -- APOLLO2, and a single-model quality assessment
method -- Qprob3. APOLLO calculates the averaged GDT-TS score between pairwise structures
in the model pool and generates the global quality scores. Qprob generates a predicted GDT-TS
score for each model, and all the models are ranked based on GDT-TS score from high to low.
The wfAll-Cheng server averaged two rankings for each model, the APOLLO score and the
Qprob score, and generated consensus rankings for all models. Specifically, when the highest
APOLLO’s GDT-TS score was less than 0.2 (eg., likely free-modeling target), only Qprob ranks
were considered to select the final top models since it has shown good performance for the
assessment of model quality of hard targets3. On the other hand, when the top APOLLO’s GDT-
TS score was larger than 0.8 (eg., easy target), we used APOLLO’s score to select the final
models.
The selected top 5 models were refined by ModRefiner4 to improve the global and local
structures. Finally, the local quality scores by ModFOLDclustQ5 were added into models before
their submission to CASP.

Results
We evaluated wfAll-Cheng along with CASP12 server predictors on 11 human targets whose
experimental structures were released to date. The sum of Z-scores of the first (i.e. TS1) models
predicted by these predictors for the 11 targets was reported in Table 1. The Z-score of a model
was calculated as the model's GDT-TS score minus the average GDT-TS score of all the models
in the model pool of a target divided by the standard deviation of GDT-TS scores. A negative Z
score is converted to 0 during summation of Z-scores for a predictor.

Table 1. The top 10 predictors were ranked based on the summation of the Z scores. wfAll-
Cheng and MULTICOM (for details, see our CASP12 abstract entitle “Tertiary Structure
Prediction by the MULTICOM Human Group”) were human predictors, while all others were

205
server predictors. The 11 targets are T0859, T0862, T0863, T0864, T0868, T0869, T0870,
T0872, T0900, T0904 and T0944.

RANK Predictor name Sum of Z_score


1 wfAll-Cheng (Human) 17.05
2 BAKER-ROSETTA SERVER 16.53
3 MULTICOM (Human) 16.29
4 GOAL 11.04
5 Zhang-Server 8.95
6 QUARK 7.88
7 MULTICOM-NOVEL 7.589
8 MULTICOM-CLUSTER 6.94
9 RaptorX 6.15
10 RaptorX-Contact 5.87

1. Khoury, G. A. et al. WeFold: a coopetition for protein structure prediction. Proteins:


Structure, Function, and Bioinformatics 82, 1850-1868 (2014).
2. Wang, Z. APOLLO: a quality assessment service for single and multiple protein models.
Bioinformatics 27, doi:10.1093/bioinformatics/btr268 (2011).
3. Cao, R. & Cheng, J. Protein single-model quality assessment by feature-based probability
density functions. Scientific reports 6 (2016).
4. Xu, D. & Zhang, Y. Improving the physical realism and structural accuracy of protein
models by a two-step atomic-level energy minimization. Biophysical journal 101, 2525-2534
(2011).
5. McGuffin, L. & Roche, D. Rapid model quality assessment for protein structure predictions
using the comparison of multiple models without structural alignments. Bioinformatics 26
(2010).

206
wf-BAKER-UNRES

Prediction of protein structure with the UNRES force field aided by contact- and
secondary-structure prediction derived from evolutionarily related proteins

A.G. Lipska1, M.D. Wiśniewska2, M.A. Mozolewska1, P. Krupa1, R. Ślusarz1, M. Ślusarz1,


A. Liwo1, S. Ovchinnikov3, D. Baker3 and S.N. Crivelli4*
1 - Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland, 2 - Centre of New
Technologies, University of Warsaw, Banacha 2c Str., 02-097 Warsaw, Poland, 3 – Department of Biochemistry, and
Molecular and Cellular Biology Program, University of Washington, Seattle, WA 98195,4 – Department of
Computer Science, UC Davis, One Shields Ave., Davis, CA 95616
sncrivelli@ucdavis.edu

We applied a similar approach to that of wfCPUNK group, in which secondary- and contact-
prediction information was implemented to aid conformational search with the physics-based
united-residue (UNRES)1 force field. However, in this branch secondary structure predictions
were obtained from PsiPred and contact prediction was carried out by GREMLIN.

Methods
In the UNRES model1, a polypeptide chain is represented by a sequence of alpha-carbon atoms
connected by virtual bonds with attached side chains. Two interaction sites are used to represent
each amino acid: the united peptide group (p) located in the middle between two consecutive -
carbon atoms and the united side chain (SC). The interactions of this simplified model are
described by the UNRES potential derived from the generalized cluster-cumulant expansion of a
restricted free energy (RFE) function of polypeptide chains. The cumulant expansion enabled us
to determine the functional forms of the multibody terms in UNRES. The effective energy
function depends on temperature and has been parameterized to reproduce structure and
thermodynamics of selected training proteins2,3. The prediction procedure involved a restrained
conformational search by Multiplexed Replica Exchange Molecular Dynamics (MREMD)4 with
the UNRES force field followed by Weighted-Histogram Analysis Method (WHAM) analysis
was used to calculate relative free energy of each structure of last slice of the MREMD
simulation3 and a cluster analysis was employed to cluster the structures from a MREMD
simulation. Five clusters with lowest free energies were chosen as prediction candidates. The
conformations closest to the respective average structures corresponding to the found clusters
were converted to all-atom structures and then refined by performing short restrained MD runs
with the AMBER12 force field to give the models which were subsequently submitted.
Contact prediction, from which the restraints were derived, were carried out with the
GREMLIN5 method. GREMLIN works by constructing a global statistical model that
simultaneously captures the conservation and co-evolution patterns in the input multiple
sequence alignment. The alignments were generated using HHblits6 and Jackhammer7 with
varying e-value and number of iterations. Strongly co-evolving residue pairs as identified by this
approach, were used as restraint in modeling.

Results

207
We postpone the assessment of the approach until the official release of CASP12 results.

Availability
The UNRES package is available at www.unres.pl. GREMLIN is available at
http://gremlin.bakerlab.org.

1. Liwo,A., Czaplewski,C., Ołdziej,S., Rojas,A.V., Kaźmierkiewicz,R., Makowski,M.,


Murarka, R.K. & Scheraga,H.A. (2008) Simulation of protein structure and dynamics with
the coarse-grained UNRES force field. In: Coarse-Graining of Condensed Phase and
Biomolecular Systems., ed. G. Voth, Taylor & Francis, Chapter 8, pp. 107-122.
2. Liwo,A., Khalili,M., Czaplewski,C., Kalinowski,S., Ołdziej,S., Wachucik,K. &
Scheraga,H.A. (2007) Modification and optimization of the united-residue (UNRES)
potential energy function for canonical simulations. I. Temperature dependence of the
effective energy function and tests of the optimization method with single training proteins.
J. Pys. Chem. B 111, 260-285.
3. Zaborowski,B., Jagieła,D., Czaplewski,C., Hałabis,A., Lewandowska,A., Żmudzińska,W.,
Ołdziej,S., Karczyńska,A., Omieczynski,C., Wirecki,T. & Liwo.A. (2015) A maximum-
likelihood approach to force-field calibration. J. Chem. Inf. Model. 55, 2050-2070.
4. Czaplewski,C., Kalinowski,S., Liwo,A. & Scheraga,H.A. (2009) Application of multiplexed
replica exchange molecular dynamics to the UNRES force field: Tests with α and α+β
proteins. J Chem. Theory Comput. 5, 627-640.
5. Kamisetty,H., Ovchinnikov,S & Baker D. (2013) Assessing the utility of coevolution-based
residue-residue contact predictions in a sequence- and structure-rich era. Proc. Natl. Acad.
Sci. U.S.A., 110, 16674-16679
6. Remmert,M., Biegert,A., Hauser,A. & Söding,J. (2011) HHblits: lightning-fast iterative
protein sequence searching by HMM-HMM alignment. Nat Methods, 9, 173-175.
7. Eddy,S.R. (2009) A new generation of homology search tools based on probabilistic
inference. Genome Inform., 23, 205-211.

208
wfCPUNK

Prediction of protein structure with the UNRES force field aided by secondary-structure
and contact prediction

A.G. Lipska1, P. Krupa2, M.A. Mozolewska2, A.K. Sieradzan1, C. Kieslich3, M. Onel3, U. Shah3,
Y. He2, Y. Yin2, Ł. Golon1, R. Ślusarz1, M. Ślusarz1, A. Liwo1, C.A. Floudas3, H.A. Scheraga2
and S.N. Crivelli4*
1 - Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland, 2 -Baker Laboratory of
Chemistry and Chemical Biology, Cornell University, Ithaca, NY, 14853-1301, U.S.A., 3 – Artie McFerrin
Department of Chemical Engineering, Texas A&M University, College Station, TX 77843-3372; USA, 4 –
Department of Computer Science, UC Davis, One Shields Ave., Davis, CA 95616, USA
sncrivelli@ucdavis.edu

Physics-based methods for protein-structure prediction are attractive because of their


independence of structural databases; however, the accuracy of the existing force fields is not yet
good enough for these methods to be used in a standalone mode. As part of the WeFold project,
in CASP12 we explored the use of contact-prediction and secondary-structure-prediction
information together with the coarse-grained physics-based united-residue (UNRES) model for
polypeptide chains1 in simulations. Owing to the simplicity of UNRES, it can perform de novo
simulations of the folding of large proteins and it is, at the same time, closely connected to the
physics of interactions.

Methods
In the UNRES model1, a polypeptide chain is represented by a sequence of alpha-carbon atoms
connected by virtual bonds with attached united side chains and united peptide groups positioned
halfway between the consecutive C(alpha) atoms. The UNRES force field used in the current
CASP experiment was optimized by using the maximum-likelihood approach to force field
calibration that we developed recently2.
In the first step of the prediction, coarse-grained simulations with the UNRES force field,
with dihedral-angle and distance restraints determined as described below were employed to
carry out Multiplexed Replica Exchange Molecular Dynamics (MREMD) 3. The conformational
ensemble was determined with the aid of the weighted histogram analysis method (WHAM) and
dissected into 5 clusters whose average conformations constituted the candidate predictions in
the coarse-grained representation4. The obtained coarse-grained structures were converted to all-
atom structures and refined by doing short MD runs with the AMBER12 force field.
Secondary structure prediction was performed using conSSert5, a consensus SVM model
that utilizes as features the probabilities for coil, helix, and strand as predicted by PSSPRED6,
PSIPRED7, RAPTORX8, and SPINE-X9. The secondary structure predictions from conSSert
were used to search a structural template library, using a modified implementation of HHsuite10.
The prediction of tertiary contacts was based on tertiary contacts extracted from the structural
templates identified by conSSert/HHsuite, and two types of contact predictions were generated:
general tertiary contacts and beta-sheet topology contacts. For the prediction of general tertiary
C(beta) contacts, Delaunay triangulation was applied to identify C(beta) contacts in each
identified structural template, and a consensus score based on the template probability from

209
HHsuite was used to rank observed contacts. The beta-sheet topology method takes as input
secondary structure and template-based tertiary contact predictions and utilizes two MILP
models: (i) a MILP model for strand pair alignment, and (ii) a MILP model for beta-sheet
topology that utilizes constraints to provide physically relevant topologies.

Results
We postpone the assessment of the approach until the official release of CASP12 results.

Availability
The UNRES package is available at www.unres.pl.

1. Liwo,A., Czaplewski,C., Ołdziej,S., Rojas,A.V., Kaźmierkiewicz,R., Makowski,M.,


Murarka, R.K. & Scheraga,H.A. (2008) Simulation of protein structure and dynamics with
the coarse-grained UNRES force field. ed. G. Voth, Taylor & Francis, Chapter 8, pp. 107-
122.
2. Zaborowski,B., Jagieła,D., Czaplewski,C., Hałabis,A., Lewandowska,A., Żmudzińska,W.,
Ołdziej,S., Karczyńska,A., Omieczynski,C., Wirecki,T. & Liwo.A. (2015) A maximum-
likelihood approach to force-field calibration. J. Chem. Inf. Model. 55, 2050-2070.
3. Czaplewski,C., Kalinowski,S. & Liwo,A., Scheraga,H.A. (2009) Application of multiplexed
replica exchange molecular dynamics to the UNRES force field: Tests with α and α+β
proteins. J Chem. Theory Comput. 5, 627-640.
4. Liwo,A., Khalili,M., Czaplewski,C., Kalinowski,S., Ołdziej,S., Wachucik,K. &
Scheraga,H.A. (2007) Modification and optimization of the united-residue (UNRES)
potential energy function for canonical simulations. I. Temperature dependence of the
effective energy function and tests of the optimization method with single training proteins.
J. Pys. Chem. B 111, 260-285.
5. Kieslich,C.A., Smadbeck,J., Khoury,G.A. & Floudas,C.A. (2015) conSSert: Consensus SVM
models for accurate prediction of ordered secondary structure. J. Chem. Inf. Modeling, 56,
455-461.
6. Zhang, Y. http://zhanglab.ccmb.med.umich.edu/PSSpred.
7. McGuffin, L.J., Bryson, K. & Jones, D.T. (2000) The PSIPRED protein structure prediction
server. Bioinformatics, 16, 404-405.
8. Wang,Z., Zhao,F., Peng,J. & Xu,J. (2011) Protein 8-class secondary structure prediction
using conditional neural fields. Proteomics, 11, 3786-3792.
9. Faraggi,E., Zhang,T., Yang,Y., Kurgan,L. & Zhou,Y. (2012) SPINE X: Improving protein
secondary structure prediction by multi-step learning coupled with prediction of solvent
accessible surface area and backbone torsion angles. J Comp Chem, 33, 259-67.
10. Söding,J. Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21:
951-960.

210
wfDB_BW_SVGroup

Assessing Protein Structure Models using Protein Structure Networks [PSN-QA]

B.K. Dhanasekaran1, Sambit Ghosh1, Saraswathi Vishveshwara1 & the WeFold Community2
1- Molecular Biophysics Unit, Indian Institute of Science, Bangalore-560012, India
2-https://wefold.nersc.gov/wordpress/
sv@mbu.iisc.ernet.in

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings
together labs and individuals through the science gateway http://wefold.nersc.gov/. WeFold
enables the interaction among groups that work on different components of the protein structure
prediction pipeline. The combination of these components creates hybrid protein structure
prediction pipelines, each submitting its own models. In its third round, the collaboration resulted
in 12 different pipelines. Here we describe the wfDB_BW_SVGroup pipeline, which combines
David Baker group from University of Washington, Seattle, WA, United States of America,
Björn Wallner group from Linkoping University, Linkoping, Sweden and SVGroup from Indian
Institute of Science, Bangalore, India.

Methods
Decoy generation:
Decoy sets were generated from the fully automated structure prediction server, Robetta1 (see
BAKER-ROSETTASERVER abstract for details). Given the target sequence, Robetta first
predicts domain boundaries by identifying PDB templates with optimal sequence similarity and
structural coverage to the target through an iterative process. For each iteration, HHSearch2,
Sparks3, and RaptorX4 are used to identify templates and generate alignments. The target
sequence is threaded onto the template structures to generate partial-threaded models and the
regions of the target sequence that are not covered by the partial-threads are passed on to the next
search iteration. Through this iterative process, non-overlapping clusters are identified that,
together, cover the full length of the target sequence and domain boundaries are assigned. For
each predicted domain, Robetta uses Rosetta comparative modeling protocol, RosettaCM5 to
recombine structural elements from the clustered partial-threads and model missing segments.
Conformational sampling is performed using the Rosetta low-resolution score function6 with
spatial restraints that are generated separately from each cluster7. If enough co-evolutionary
sequence data exists to accurately predict residue- residue contacts using GREMLIN8 the clusters
are re-ranked using this information, and the spatial restraints are supplemented with the
predicted contacts. All models are refined using a relax protocol9 that minimizes the Rosetta
full-atom energy10 in torsion and Cartesian space to allow bond angle flexibility.

Filtering:
The accuracy of all generated domain models were estimated using ProQ211 that was recently
implemented as a scoring function in Rosetta12, and the top 1,000 for each domain were selected.
In total, 32,474,636 domain models were scored using ProQ2 in CASP12.
Ranking:
Protein Structure Network Quality Assessment tool (PSN-QA) was used to rank the final set of

211
models. PSN-QA tool is based on the graph theoretical approach to study protein structures
where a protein structure is considered as a network. In the network of protein structure, amino
acids of the protein structure are considered as node and the edges are constructed between these
nodes based on the non-covalent interaction strength13-14 between the side-chain atoms.
Interaction strength is based on the number of interacting atoms between a pair of amino acids
(Iij). The network properties such as size of the largest cluster (SLClu), largest K-2 communities
(ComSk2) and clustering coefficients are able to capture the general pattern exhibited by native
proteins. It is interesting to note that the native protein structures consistently display a steeper
transition profile as a function of the interaction strength cut-off (Imin), when compared to the
modeled structure (Fig1).

Fig1: Transition profile of network parameters for native and decoy structures.

This transition profile is a characteristic property of native protein structures and can be
used to differentiate good models, showing native like properties from non-native like structures.
Across different Imins, these parameters along with main-chain hydrogen bond total up to 94
parameters for a single protein structure. An SVM model is then trained based on these 94
parameters, wherein the 5422 native structures from PDB forms the positive set while the 29543
decoy models from various sources form the negative set15-16. The tool is capable of classifying
the given protein structure model as good and bad further provide with the score for the same.
The scores lie between -30 to 20, and imply the following:

<14: Bad Models.


14-16: Transition Zone. Models classified either as good or bad.
16 - 20: Good models. Models show native like properties.
A group of very similar structures can be ranked based on its score. PSN-QA was used to
obtain the ranks of all group and server models in the regular category from the CASP12
website. It was also used to classify and rank the 1000 models, which were produced by the
wfDB_BW_SVGroup pipeline.

Availability.
PSN-QA is available at https://http://vishgraph.mbu.iisc.ac.in/GraProStr/PSN-QA.html

212
1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the
Robetta server. Nucleic Acids Res 32, W526-W531.
2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21
(7), 951-960.
3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of the query and corresponding native properties of
templates. Bioinformatics 27 (15), 2076-2082.
4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by
statistical inference. Proteins 79, 161-171.
5. Song,Y., DiMaio,F., Wang,R., Kim,D., Miles,C., Brunette,T., Thompson,J., & Baker,D.
(2013). High-resolution comparative modeling with RosettaCM. Structure 21 (10), 1735-
1742.
6. Leaver-Fay,A., Tyka,M., Lewis,S., Lange,O.F., Thompson,J., Jacak,R., Kaufman,K.,
Renfrew,P.D., Smith,C., Sheffler,W., Davis,I., Cooper,S., Treuille,A., Mandell,D., Richter,F.,
Ban,Y.A., Fleishman,S., Corn,J., Kim,D.E., Lyskov,S., Berrondo,M., Mentzer,S., Popović,Z.,
Havranek,J., Karanicolas,J., Das,R., Meiler,J., Kortemme,T., Gray,J.J., Kuhlman,B., Baker,D.
& Bradley,P. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the Simulation
and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta
comparative modeling. Proteins 79 (8), 2380-2388.
8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based
residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39)
15674-15679.
9. Conway,P., Tyka,M.D., DiMaio,F., Konerding, D.E. & Baker,D. (2014). Relaxation of
backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23 (1),
47-55.
10. Tyka, M.D., Keedy, D.A., André, I., Dimaio, F., Song, Y., Richardson, D.C., Richardson, J.S.,
and Baker, D. (2011). Alternate states of proteins revealed by detailed energy landscape
mapping. J. Mol. Biol. 405, 607–618.
11. Ray A, Lindahl E, Wallner B. (2012). Improved model quality assessment using ProQ2. BMC
Bioinformatics 13, 224.
12. Uziela K, Wallner B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics. 32 (9), 1411-3
13. Kannan N, Vishveshwara S (1999) Identification of side-chain clusters in protein structures by
a graph spectral method. J Mol Biol 292: 441–464.
14. Brinda K, Vishveshwara S: A network representation of protein structures: implications for
protein stability. Biophys J. 2005; 89(6): 4159–4170.
15. Chatterjee S, Ghosh S, Vishveshwara S: Network properties of decoys and CASP predicted
models: A comparison with native protein structures. Mol Biosyst. 2013; 9(7): 1774–1788.
16. Ghosh S and Vishveshwara S (2014) Ranking the quality of protein structure models using
sidechain based network properties [v1; ref status: indexed http://f1000r.es/2eu]
F1000Research 2014, 3:17 (doi:10.12688/f1000research.3-17.

213
wfMESHI-Seok

WeFold – the wfMESHI-Seok branch

T. Sidi1, C. Keasar1, Lim Heo2, Gyu Rie Lee2, Minkyung Baek2, Chaok Seok2 and Silvia
Crivelli3
1 - Ben-Gurion University of the Negev, 2 - Department of Chemistry, Seoul National University, Seoul, 3- Lawrence
Berkeley National Laboratory, 1 Cyclotron Road, Berkeley
sncrivelli@lbl.gov

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings
together labs and individuals through the science gateway http://wefold.nersc.gov/ and provides
computing and storage resources through the National Energy Research Scientific Computing
(NERSC) center. WeFold enables the interaction among groups that work on different
components of the protein structure prediction pipeline, thus making it possible to leverage
expertise at a scale that has not been done before. The combination of these components creates
hybrid protein structure prediction pipelines, each submitting its own models. This collaboration
aims to promote a synergistic effect among the participants and ultimately produce better results
than those achieved by the individual methods. In its third round, the collaboration resulted in 12
different pipelines. Here we describe the wfMESHI-Seok branch, which combines decoys of
CASP servers, scoring by MESHI_Score and refinement and selection by GalaxyRefine.

Methods
In the first stage of this weFold branch, the MESHI group downloaded server decoys from the
CASP web site, and scored them. To this end we used practically the same protocol as
MESHI_SERVER, applying it to complete decoy sets (the T0XXX.3D.srv.tar.gz tarballs), and
uploading the list of scored decoys to the weFold site. The MESHI_SERVER protocol is
described in its own abstract. In a nutshell, it first standardizes the decoys by scwrl4 2 rotamer
optimization followed by energy minimization. Then, the protocol extracts 106 structural
features from each decoy and feeds them to MESHI_Score3, an ensemble of a thousand
independent predictors. Each of these predictors is trained to predict decoy qualities using a
unique subset of the features. The final score is the weighted median of the thousand individual
scores.
The number of server models to further refine was reduced to 48 by taking the models
with the highest MESHI scores that are structurally distinct with mutual TMscore lower than 95.
The representative 48 models were refined by using GalaxyRefine1. This refinement method
involves repetitive short molecular dynamics relaxations after perturbing the structure by
sidechain repacking. Backbone conformational change is driven by sidechain repacking in this
way. The energy function used by GalaxyRefine is a hybrid of molecular mechanics-based
energy components and knowledge-based components. In CASP12, a newly developed
knowledge-based potential that considers solvation states of interacting atoms as well as their
distances replaced the dDFIRE potential in the previous energy function used in CASP11. The
refined models were ranked by the new knowledge-based potential, and the top five models were
submitted as final predictions.

214
1. Heo,L., Park,H. & Seok,C. (2013) GalaxyRefine: Protein structure refinement driven by side-
chain repacking. Nucleic Acids Res. 41 (W1), W384-W388.
2. Krivov,G.G. et al. (2009) Improved prediction of protein side-chain conformations with
SCWRL4. Proteins, 77, 778–795.
3. Mirzaei,S. et al. (2016) Purely Structural Protein Scoring Functions Using Support Vector
Machine and Ensemble Learning. IEEE/ACM Transactions on Computational Biology and
Bioinformatics, in press.

215
wfMESHI-TIGRESS

WeFold – the wfMESHI-TIGRESS branch

Tomer Sidi1, Chen. Keasar1, Melis Onel2, Utkarsh Shah2, Chris Kieslich2, Christodoulos A.
Floudas2, and S.N. Crivelli3
1 - Ben-Gurion University of the Negev, 2 – Texas A&M University, 3 - Lawrence Berkeley National Laboratory,
USA

WeFold1 is an open collaboration initiative for protein structure prediction within CASP. It brings
together labs and individuals through the science gateway http://wefold.nersc.gov/ and provides
computing and storage resources through the National Energy Research Scientific Computing
(NERSC) center. WeFold enables the interaction among groups that work on different
components of the protein structure prediction pipeline, thus making it possible to leverage
expertise at a scale that has not been done before. The combination of these components creates
hybrid protein structure prediction pipelines, each submitting its own models. This collaboration
aims to promote a synergistic effect among the participants and ultimately produce better results
than those achieved by the individual methods. In its third round, the collaboration resulted in 12
different pipelines. Here we describe the wfMESHI-TIGRESS branch, which combines decoys
of CASP servers, scoring by MESHI_Score and refinement by Princeton_TIGRESS2.

Methods

In the first stage of this WeFold branch, the MESHI group downloaded server decoys from the
CASP web site, and scored them. To this end we used practically the same protocol as
MESHI_SERVER, applying it to complete decoy sets (the T0XXX.3D.srv.tar.gz tarballs), and
uploading the list of scored decoys to the weFold site. The MESHI_SERVER protocol is
described in its own abstract. In a nutshell, it first standardizes the decoys by scwrl4 3 rotamer
optimization followed by energy minimization. Then, the protocol extracts 106 structural
features from each decoy and feed them to MESHI_Score4, an ensemble of a thousand
independent predictors. Each of these predictors is trained to predict decoy qualities using a
unique subset of the features. The final score is the weighted median of the thousand individual
scores.

For the second stage of wfMESHI-TIGRESS, the FLOUDAS group applied protein refinement
via Princeton_TIGRESS to the top 5 unique decoys identified by the MESHI_SERVER. The
decoys were submitted to the Princeton_TIGRESS webserver, following the same procedure
utilized by FLOUDAS_REFINESERVER. Princeton_TIGRESS utilizes a strategy consisting of
separate sampling and selection stages, with sampling involving CYANA5 torsion angle
dynamics and Rosetta FastRelax6, and selection based on an SVM predictor. The SVM model
includes a decomposition of physics-based and hybrid energy functions, as well as a geometry-
free representation of the protein structure through distance-binning Cα-Cα distances to capture
fine-grained movements. Following selection, CHARMM7 molecular dynamics simulations are
utilized for further refinement. A more elaborate description of the refinement protocol can be
found in the abstract of FLOUDAS_REFINESERVER. The protocol was followed consistently
with no manual intervention.

216
Availability
 Princeton_TIGRESS refinement webserver is available at
http://atlas.engr.tamu.edu/refinement/.
 The MESHI software package (version 9.29, which was used in CASP12) is available in:
https://www.dropbox.com/sh/mb31bjdvvydhuzh/AADVcclTZKtFiSl6I9hBx8Dxa?dl=0

1. Khoury, G. A., Liwo, A., Khatib, F., Zhou, H., Chopra, G., Bacardit, J., Bortot, L., Delbum,
A. C. B., Deng, X., Faccioli, R., He, Y., Krupa, P., Li, J., Mozolewska, M., Baker, D.,
Cheng, J., Floudas, C. A., Keasar, C., Levitt, M., PopaviL, Z., Scheraga, H. A., Skolnick, J.,
Crivelli, S. N. & Players, F. (2014). WeFold: A Coopetition for Protein Structure Prediction.
Proteins: Structure, Function, Bioinformatics 82, 1850-1868.
2. Khoury, G. A., Tamamis, P., Pinnaduwage, N., Smadbeck, J., Kieslich, C. A. & Floudas,
C.A. (2014). Princeton_TIGRESS: Protein geometry refinement using simulations and
support vector machines. Proteins: Structure, Function, and Bioinformatics 82, 794-814.
3. Krivov,G.G. et al. (2009) Improved prediction of protein side-chain conformations with
SCWRL4. Proteins, 77, 778–795.
4. Mirzaei,S. et al. (2016) Purely Structural Protein Scoring Functions Using Support Vector
Machine and Ensemble Learning. IEEE/ACM Transactions on Computational Biology and
Bioinformatics, in press.
5. Guntert, P. (2004). Automated NMR structure calculation with CYANA. METHODS IN
MOLECULAR BIOLOGY-CLIFTON THEN TOTOWA- 278, 353-378.
6. Leaver-Fay, A., Tyka, M., Lewis, S. M., Lange, O. F., Thompson, J., Jacak, R., Kaufman, K.,
Renfrew, P. D., Smith, C. A., Sheffler, W., Davis, I. W., Cooper, S., Treuille, A., Mandell, D.
J., Richter, F., Ban, Y. E., Fleishman, S. J., Corn, J. E., Kim, D. E., Lyskov, S., Berrondo,
M.,
Mentzer, S., Popovic, Z., Havranek, J. J., Karanicolas, J., Das, R., Meiler, J., Kortemme, T.,
Gray, J. J., Kuhlman, B., Baker, D. & Bradley, P. (2011). ROSETTA3: an object-oriented
software suite for the simulation and design of macromolecules. Methods in Enzymology
487,
545-74.
7. MacKerell, J., A. D., Brooks, B., Brooks, III, C.L., Nilsson, L., Roux, B., Won, Y., and
Karplus, M. (1998). CHARMM: The Energy Function and Its Parameterization with an
Overview of the Program. In The Encyclopedia of Computational Chemistry (al., P. v. R.
S.e., ed.), Vol. 1, pp. 271-277. John Wiley & Sons: Chichester.

217
wfRosetta-MUfold

Protein structure refinements by using MUfold to sample Rosetta decoys

Hongbo Li1,2 and Dong Xu2


1-School of Computer Science and Information Technology, NorthEast Normal University, Changchun, 130117,
China, 2-Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri,
Columbia, Missouri, 65211, USA.
xudong@missouri.edu

A number of effective tools for protein structure prediction, such as I-TASSER, Modeller and
Rosetta have been developed. These tools often generate many models for a given target protein
sequence. There is a significant room to improve prediction accuracy by better sampling, ranking
and combining these models. In this work, we propose a method to refine protein structures by
sampling models generated by the Rosetta server. We assemble similar models from the Rosetta
server and use them as templates to generate new models based on MUfold[1]. We have used a
QA strategy which combines a consensus method and some single-model scores to select the
good candidates from the new models.

Methods
For each target, more than 10,000 models were generated by the Rosetta server[2]. The quality of
these models were estimated using ProQ2[3] and top 1000 of them were used as the input of our
method. The outputs are 5 good refined models through the following 4 steps:
Step 1. Filtering redundant models from the initial pool of 1000 models: We calculate the
pairwise GDT-TS between the input models and remove redundant models with a filtering cutoff
of 0.95.
Step 2. Selecting good starting models: We use affinity propagation[4] to cluster the
filtered input models. We select the representatives of up to top 30 clusters and sort them by
cluster sizes as the starting models.
Step 3. Selecting good candidates from the new models by MUfold-QA: For each
starting model sm, we find those models that are most similar to sm and use them (including sm)
as templates to build new models by MUfold and Modeller[5]. A scoring function SC in MUfold-
QA is defined to measure the quality of new models. According to SC, if a new model nm is
better than sm, then we use nm as current sm and repeat the procedure. The iteration ends when
no new model is better than current sm and the candidate will be the one with the highest SC
score.
Step 4. Ranking good candidates from the new models by a QA strategy: We use the SC
score to sort all the candidates and the best 5 models will be the final output. The filtered input
decoys are used as a reference set ref. Given a predicted model mi, the scoring function is defined
as follow,
SC(mi)=3*norm(cus(mi))+norm(dfire(mi))+norm(opusca(mi))+norm(cheng(mi))
𝑠𝑖 −𝑠𝑚𝑖𝑛
norm is normalization in the range of 0-1, norm(si)=𝑠 , where smax and smin are the
𝑚𝑎𝑥 −𝑠𝑚𝑖𝑛
maximum and minimum of score s of the models in ref. As for the consensus score, we do not
calculate the pairwise similarity between the models, because we generate thousands of models

218
and calculating the pairwise similarity to get a consensus score is very time consuming. Instead,
it is calculated using MUfold-CL[6]: For each model, we calculate the distance between every
pair of Ca atoms and these distances compose a vector of this model. We first generate the
vectors for all the filtered input models, and then we calculate the centroid vector of them. The
consensus score of a model mi is defined by the Dscore1(Di, Dc) [5], where Di is the vector of mi
and Dc is the centroid vector. The scoring functions of dfire, opusca and cheng are defined in [6,
7, 8], respectively.

1. Zhang J., Wang Q., Barz B., He Z., Kosztin I., Shang Y. and Xu D. (2010). MUFOLD: A
new solution for protein 3D structure prediction. Proteins. 78, 1137-1152.
2. Kim, D.E., Chivian, D. and Baker, D. (2004). Protein structure prediction and analysis using
the Robetta server. Nucleic Acids Res. 32, 526-531.
3. Ray A., Lindahl E. and Wallner B. (2012). Improved model quality assessment using ProQ2.
BMC Bioinformatics, 13, 1567-1587.
4. Frey B. J. and Dueck D. (2007). Clustering by Passing Messages Between Data Points.
Science. 315, 972–976.
5. Sali A., Blundell T. L. (1993). Comparative protein modelling by satisfaction of spatial
restraints. J Mol Biol. 234, 779–815.
6. Zhang J. and Xu D. (2013). Fast algorithm for population-based protein structural model
analysis. Proteomics. 13, 221–229.
7. Zhou H., Zhou Y. (2002). Distance-scaled, finite ideal-gas reference state improves structure-
derived potentials of mean force for structure selection and stability prediction. Protein Sci.
11, 2714–2726.
8. Wu Y., Lu M., Chen M., Li J., Ma J. (2007). OPUSCa: a knowledge-based potential function
requiring only Ca positions. Protein Sci. 16, 1449–1463.
9. Wang Z., Tegge A., Cheng J. (2009). Evaluating the absolute quality of a single protein
model using structural features and support vector machines. Proteins: Struct Funct
Bioinformatics.75, 638–647.

219
wfRosetta-ProQ-MESHI

WeFold – the wfRosetta-ProQ-MESHI pipeline

T. Sidi1, C. Keasar1, D. E. Kim2, D. Baker2, B. Wallner3 and S.N. Crivelli4


1– Department of Computer Science, Ben-Gurion University of the Negev, Israel, 2 - Institute for Protein Design,
University of Washington, USA, 3 - Division of Bioinformatics, Department of Physics, Chemistry and Biology,
Linköping University, Sweden, 4–Lawrence Berkeley National Laboratory, USA
chen@cs.bgu.ac.il

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings
together researchers through the science gateway http://wefold.nersc.gov/ and provides
computing and storage resources through the National Energy Research Scientific Computing
center. WeFold enables the interaction among groups that work on different components of the
protein structure prediction pipeline. The combination of these components creates hybrid protein
structure prediction pipelines, each submitting its own models. This collaboration aims to
promote a synergistic effect among the participants and ultimately produce better results than
those achieved by the individual methods. In its third round, the collaboration resulted in 12
different pipelines. Here we describe the wfRosetta-ProQ-MESHI pipeline, which applied a two
stage selection process to domain decoys generated by Robetta.

Methods
Decoy generation: Decoy sets were generated from the Robetta server1 (see BAKER-
ROSETTASERVER abstract for details). Given the target sequence, Robetta first predicts
domain boundaries by identifying PDB templates with optimal sequence similarity and structural
coverage to the target through an iterative process. For each iteration, HHSearch 2, Sparks3, and
RaptorX4 are used to identify templates and generate alignments. The target sequence is threaded
onto the template structures to generate partial-threaded models, which are then clustered to
identify distinct topologies that are ranked based on the likelihood of the alignments. Regions of
the target sequence that are not covered by the partial-threads or are not similar in structure
within the top ranked cluster are passed on to the next search iteration. Through this iterative
process, non-overlapping clusters are identified that, together, cover the full length of the target
sequence and domain boundaries are assigned at the transitions between the clusters. The
modeling difficulty of each domain is determined by the degree of structural consensus between
the top ranked partial threads from each alignment method. For each predicted domain, Robetta
uses the Rosetta comparative modeling protocol, RosettaCM5, which recombines structural
elements from the clustered partial-threads and models missing segments using a combination of
fragment insertion and mixed torsion-Cartesian space minimization. Conformational sampling is
performed using the Rosetta low-resolution score function6 with spatial restraints that are
generated separately from each cluster7. If enough co-evolutionary sequence data exists to
accurately predict residue- residue contacts using GREMLIN8 the clusters are re-ranked using
this information, and the spatial restraints are supplemented with the predicted contacts. For
difficult domains, models are also generated using the Rosetta fragment assembly methodology6

220
(RosettaAbinitio), and if GREMLIN contacts are predicted, they are used as restraints for
sampling and refinement. All models are refined using a relax protocol9 that minimizes the
Rosetta full-atom energy10 in torsion and Cartesian space to allow bond angle flexibility. Up to
four RosettaCM decoy sets were generated for each domain depending on the number of
template clusters and all RosettaCM models were provided. Two RosettaAbinitio decoy sets
were provided for each difficult domain, the top 5 percent all-atom Rosetta scoring models and
the top 15 percent contact order11 scoring models. Large scale sampling of around 10,000 to
300,000 models was possible through the use of the distributed computing project,
Rosetta@home.

Filtering: The accuracy of all generated domain models was estimated using ProQ212, which was
recently implemented as a scoring function in Rosetta13, and the top 1,000 models for each
domain were selected. In total, 32,474,636 domain models were scored using ProQ2 in CASP12.

Final selection and model submission: Top 1000 comparative modeling and top 1000 ab-initio
models (the latter, when available) for each predicted domain were downloaded from the WeFold
server, and fed to the MESHI_Server protocol. This protocol is described in the
MESHI_SERVER abstract. In a nutshell, it first standardizes the decoys by scwrl4 14 rotamer
optimization followed by energy minimization. Then, the protocol extracts 106 structural
features from each decoy and feeds them to MESHI_Score, an ensemble of a thousand
independent predictors. Each of these predictors is trained to predict decoy qualities using a
unique subset of the features. The final score is the weighted median of the thousand individual
scores. For each domain, we submitted the five standardized decoys with highest MESHI_Score.

Availability: The MESHI software package (version 9.29, which was used in CASP12) is
available at:
https://www.dropbox.com/sh/mb31bjdvvydhuzh/AADVcclTZKtFiSl6I9hBx8Dxa?dl=0

1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the
Robetta server. Nucleic Acids Res 32, W526-W531.
2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21
(7), 951-960.
3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of the query and corresponding native properties of
templates. Bioinformatics 27 (15), 2076-2082.
4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by
statistical inference. Proteins 79, 161-171.
5. Song,Y. et al. (2013). High-resolution comparative modeling with RosettaCM. Structure 21
(10), 1735- 1742.
6. Leaver-Fay,A. et al. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the
Simulation and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta
comparative modeling. Proteins 79 (8), 2380-2388.
8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based
residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39)

221
15674-15679.
9. Conway,P., Tyka,M.D., DiMaio,F., Konerding, D.E. & Baker,D. (2014). Relaxation of
backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23 (1),
47-55.
10. Tyka,M.D. et al. (2011). Alternate states of proteins revealed by detailed energy landscape
mapping. JMB 405, 607–18.
11. Plaxco,K.W., Simons,K.T., Baker,D. (1998). Contact order, transition state placement and the
refolding rates of single domain proteins. J. Mol. Biol. 277 (4) 985–994.
12. Ray,A., Lindahl,E, Wallner,B. (2012). Improved model quality assessment using ProQ2.
BMC Bioinformatics 13, 224.
13. Uziela,K, Wallner,B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinf. 32 (9), 1411-3
14. Krivov,G.G et al. (2009) Improved prediction of protein side-chain conformations with
SCWRL4. Proteins, 77, 778–795.

222
wfRosetta-ProQ-ModF6

Selection of Rosetta decoys using ProQ2 and the ModFOLD6 server

L.J. McGuffin1, A.H.A. Maghrabi1, D.E. Kim2, D. Baker2, B. Wallner3, and S.N. Crivelli4
1 - School of Biological Sciences, University of Reading, Reading, UK, 2 - Dept. of Biochemistry, University of
Washington, USA, 3 - Division of Bioinformatics, Department of Physics, Chemistry, and Biology, Link öping
University, Sweden, 4 - Computational Research Division, Lawrence Berkeley National Laboratory, USA
l.j.mcguffin@reading.ac.uk

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings
together labs and individuals through the science gateway http://wefold.nersc.gov/ and provides
computing and storage resources through the National Energy Research Scientific Computing
(NERSC) center. WeFold enables the interaction among groups that work on different
components of the protein structure prediction pipeline, thus making it possible to leverage
expertise at a scale that has not been done before. The combination of these components creates
hybrid protein structure prediction pipelines, each submitting its own models. This collaboration
aims to promote a synergistic effect among the participants and ultimately produce better results
than those achieved by the individual methods. In its third round, the collaboration resulted in 12
different pipelines. Here we describe the wfRosetta-ProQ-ModF6 pipeline, which combines
output from Rosetta, ProQ2 and ModFOLD6.

Methods

Decoy generation:
Decoy sets were generated from the fully automated structure prediction server, Robetta1 (see
BAKER-ROSETTASERVER abstract for details). Given the target sequence, Robetta first
predicts domain boundaries by identifying PDB templates with optimal sequence similarity and
structural coverage to the target through an iterative process. For each iteration, HHSearch 2,
Sparks3, and RaptorX4 are used to identify templates and generate alignments. The target
sequence is threaded onto the template structures to generate partial-threaded models, which are
then clustered to identify distinct topologies that are ranked based on the likelihood of the
alignments. Regions of the target sequence that are not covered by the partial-threads or are not
similar in structure within the top ranked cluster are passed on to the next search iteration.
Through this iterative process, non-overlapping clusters are identified that, together, cover the
full length of the target sequence and domain boundaries are assigned at the transitions between
the clusters. The modeling difficulty of each domain is determined by the degree of structural
consensus between the top ranked partial threads from each alignment method. For each
predicted domain, Robetta uses the Rosetta comparative modeling protocol, RosettaCM5, which
recombines structural elements from the clustered partial-threads and models missing segments
using a combination of fragment insertion and mixed torsion-Cartesian space minimization.
Conformational sampling is performed using the Rosetta low-resolution score function6 with

223
spatial restraints that are generated separately from each cluster7. If enough co-evolutionary
sequence data exists to accurately predict residue- residue contacts using GREMLIN8 the
clusters are re-ranked using this information, and the spatial restraints are supplemented with the
predicted contacts. For difficult domains, models are also generated using the Rosetta fragment
assembly methodology6 (RosettaAbinitio), and if GREMLIN contacts are predicted, they are
used as restraints for sampling and refinement. All models are refined using a relax protocol9
that minimizes the Rosetta full-atom energy10 in torsion and Cartesian space to allow bond angle
flexibility. Up to four RosettaCM decoy sets were generated for each domain depending on the
number of template clusters and all RosettaCM models were provided. Two RosettaAbinitio
decoy sets were provided for each difficult domain, the top 5 percent all-atom Rosetta scoring
models and the top 15 percent contact order11 scoring models. Large scale sampling of around
10,000 to 300,000 models was possible through the use of the distributed computing project,
Rosetta@home.

Filtering:
The accuracy of all generated domain models was estimated using ProQ212 that was recently
implemented as a scoring function in Rosetta13, and the top 1,000 models for each domain were
selected. In total, 32,474,636 domain models were scored using ProQ2 in CASP12.

Model quality assessment, domain recombination and final model selection:


The top 250 ProQ2 filtered models for each domain were submitted for ranking using the
ModFOLD6_rank option on our latest version of the ModFOLD server14,15,16 (see our
ModFOLD6 abstract for details). If a target had more than one predicted domain, then the top 5
ranked models for each domain were recombined using the domain_assembly script obtained
from the Cheng group (Jie Hou, personal communication), which implements MODELLER17.
The final set of full chain models were then scored and re-ranked again using ModFOLD6_rank.
The per-residue error estimates were added to the B-factor column in each model file and the top
5 ranked models were submitted.

Availability
Rosetta is available via: https://www.rosettacommons.org/software/
ProQ2 is available via: https://github.com/bjornwallner/ProQ_scripts
The ModFOLD6 server is available at:
http://www.reading.ac.uk/bioinf/ModFOLD/ModFOLD6_form.html

1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the
Robetta server. Nucleic Acids Res 32, W526-W531.
2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics
21 (7), 951-960.
3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of the query and corresponding native properties of
templates. Bioinformatics 27 (15), 2076-2082.

224
4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by
statistical inference. Proteins 79, 161-171.
5. Song,Y., DiMaio,F., Wang,R., Kim,D., Miles,C., Brunette,T., Thompson,J., & Baker,D.
(2013). High-resolution comparative modeling with RosettaCM. Structure 21 (10), 1735-
1742.
6. Leaver-Fay,A., Tyka,M., Lewis,S., Lange,O.F., Thompson,J., Jacak,R., Kaufman,K.,
Renfrew,P.D., Smith,C., Sheffler,W., Davis,I., Cooper,S., Treuille,A., Mandell,D., Richter,F.,
Ban,Y.A., Fleishman,S., Corn,J., Kim,D.E., Lyskov,S., Berrondo,M., Mentzer,S., Popović,Z.,
Havranek,J., Karanicolas,J., Das,R., Meiler,J., Kortemme,T., Gray,J.J., Kuhlman,B., Baker,D.
& Bradley,P. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the Simulation
and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta
comparative modeling. Proteins 79 (8), 2380-2388.
8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based
residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39)
15674-15679.
9. Conway,P., Tyka,M.D., DiMaio,F., Konerding, D.E. & Baker,D. (2014). Relaxation of
backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23 (1),
47-55.
10. Tyka,M.D., Keedy,D.A., André,I., Dimaio,F., Song,Y., Richardson,D.C., Richardson,J.S.,
and Baker,D. (2011). Alternate states of proteins revealed by detailed energy landscape
mapping. J. Mol. Biol. 405, 607–618.
11. Plaxco,K.W., Simons,K.T., Baker,D. (1998). Contact order, transition state placement and the
refolding rates of single domain proteins. J. Mol. Biol. 277 (4) 985–994.
12. Ray,A, Lindahl,E, Wallner,B. (2012). Improved model quality assessment using ProQ2.
BMC Bioinformatics 13, 224.
13. Uziela,K, Wallner,B. (2016). ProQ2: estimation of model accuracy implemented in
Rosetta. Bioinformatics. 32 (9), 1411-3.
14. McGuffin,L.J. (2008) The ModFOLD Server for the Quality Assessment of Protein
Structural Models. Bioinformatics.24, 586-587.
15. McGuffin,L.J. (2009) Prediction of global and local model quality in CASP8 using the
ModFOLD server. Proteins. 77, 185-190.
16. McGuffin,L.J., Buenavista,M.T., Roche,D.B. (2013) The ModFOLD4 Server for the
Quality Assessment of 3D Protein Models. Nucleic Acids Res., 41, W368-72.
17. Webb,B., Sali.,A. (2014) Comparative Protein Structure Modeling Using Modeller.
Current Protocols in Bioinformatics. John Wiley & Sons, Inc., 5.6.1-5.6.32,

225
wfRosetta-Wallner

WeFold – the wfRosetta-Wallner branch

B. Wallner1, D. E. Kim2, and S. Crivelli3


1 - Division of Bioinformatics, Department of Physics, Chemistry and Biology, Link öping University, 581 83
Linköping, Sweden, 2 - Institute for Protein Design, University of Washington, Seattle, Washington, 98195, 3 –
Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley
bjornw@ifm.liu.se

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings
together labs and individuals through the science gateway http://wefold.nersc.gov/ and provides
computing and storage resources through the National Energy Research Scientific Computing
(NERSC) center. WeFold enables the interaction among groups that work on different
components of the protein structure prediction pipeline, thus making it possible to leverage
expertise at a scale that has not been done before. The combination of these components creates
hybrid protein structure prediction pipelines, each submitting its own models. This collaboration
aims to promote a synergistic effect among the participants and ultimately produce better results
than those achieved by the individual methods. In its third round, the collaboration resulted in 12
different pipelines. Here we describe the wfRosetta-Wallner pipeline, which applied a selection
process and domain assembly protocol to domain decoys generated by Robetta.

Methods

Decoy generation:
Decoy sets were generated from the fully automated structure prediction server, Robetta1 (see
BAKER-ROSETTASERVER abstract for details). Given the target sequence, Robetta first
predicts domain boundaries by identifying PDB templates with optimal sequence similarity and
structural coverage to the target through an iterative process. For each iteration, HHSearch2,
Sparks3, and RaptorX4 are used to identify templates and generate alignments. The target
sequence is threaded onto the template structures to generate partial-threaded models, which are
then clustered to identify distinct topologies that are ranked based on the likelihood of the
alignments. Regions of the target sequence that are not covered by the partial-threads or are not
similar in structure within the top ranked cluster are passed on to the next search iteration.
Through this iterative process, non-overlapping clusters are identified that, together, cover the
full length of the target sequence and domain boundaries are assigned at the transitions between
the clusters. The modeling difficulty of each domain is determined by the degree of structural
consensus between the top ranked partial threads from each alignment method. For each
predicted domain, Robetta uses the Rosetta comparative modeling protocol, RosettaCM5, which
recombines structural elements from the clustered partial-threads and models missing segments
using a combination of fragment insertion and mixed torsion-Cartesian space minimization.
Conformational sampling is performed using the Rosetta low-resolution score function6 with
spatial restraints that are generated separately from each cluster7. If enough co-evolutionary

226
sequence data exists to accurately predict residue- residue contacts using GREMLIN8 the
clusters are re-ranked using this information, and the spatial restraints are supplemented with the
predicted contacts. For difficult domains, models are also generated using the Rosetta fragment
assembly methodology6 (RosettaAbinitio), and if GREMLIN contacts are predicted, they are
used as restraints for sampling and refinement. All models are refined using a relax protocol9
that minimizes the Rosetta full-atom energy10 in torsion and Cartesian space to allow bond angle
flexibility. Up to four RosettaCM decoy sets were generated for each domain depending on the
number of template clusters and all RosettaCM models were provided. Two RosettaAbinitio
decoy sets were provided for each difficult domain, the top 5 percent all-atom Rosetta scoring
models and the top 15 percent contact order11 scoring models. Large scale sampling of around
10,000 to 300,000 models was possible through the use of the distributed computing project,
Rosetta@home.

Filtering:
The accuracy of all generated domain models was estimated using ProQ212, which was
recently implemented as a scoring function in Rosetta13, and the top 1,000 models for each
domain were selected. In total, 32,474,636 domain models were scored using ProQ2 in CASP12.

Final selection and domain assembly:


For each domain we selected the top 5 models as ranked by ProQ2 after removing any models
which were similar within 1 Å RMSD. In the case of single domain proteins, those 5 models
were submitted. For multi-domain proteins, the domains were assembled by inserting fragments
in the linker regions guided by inter-domain contact predictions from PconsC3
(http://pconsc3.bioinfo.se) using the remodel protocol in Rosetta.

1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the
Robetta server. Nucleic Acids Res 32, W526-W531.
2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics
21 (7), 951-960.
3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of the query and corresponding native properties of
templates. Bioinformatics 27 (15), 2076-2082.
4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by
statistical inference. Proteins 79, 161-171.
5. Song,Y., DiMaio,F., Wang,R., Kim,D., Miles,C., Brunette,T., Thompson,J., & Baker,D.
(2013). High-resolution comparative modeling with RosettaCM. Structure 21 (10), 1735-
1742.
6. Leaver-Fay,A., Tyka,M., Lewis,S., Lange,O.F., Thompson,J., Jacak,R., Kaufman,K.,
Renfrew,P.D., Smith,C., Sheffler,W., Davis,I., Cooper,S., Treuille,A., Mandell,D., Richter,F.,
Ban,Y.A., Fleishman,S., Corn,J., Kim,D.E., Lyskov,S., Berrondo,M., Mentzer,S., Popović,Z.,
Havranek,J., Karanicolas,J., Das,R., Meiler,J., Kortemme,T., Gray,J.J., Kuhlman,B., Baker,D.
& Bradley,P. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the Simulation

227
and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta
comparative modeling. Proteins 79 (8), 2380-2388.
8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based
residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39)
15674-15679.
9. Conway,P., Tyka,M.D., DiMaio,F., Konerding, D.E. & Baker,D. (2014). Relaxation of
backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23 (1),
47-55.
10. Tyka,M.D., Keedy,D.A., André,I., Dimaio,F., Song,Y., Richardson,D.C., Richardson,J.S., and
Baker,D. (2011). Alternate states of proteins revealed by detailed energy landscape mapping.
J. Mol. Biol. 405, 607–618.
11. Plaxco,K.W., Simons,K.T., Baker,D. (1998). Contact order, transition state placement and the
refolding rates of single domain proteins. J. Mol. Biol. 277 (4) 985–994.
12. Ray,A, Lindahl,E, Wallner,B. (2012). Improved model quality assessment using ProQ2.
BMC Bioinformatics 13, 224.
13. Uziela,K, Wallner,B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics. 32 (9), 1411-3
14. Krivov,G.G. et al. (2009) Improved prediction of protein side-chain conformations with
SCWRL4. Proteins, 77, 778–795.

228
wfRosetta-PQ-MESHI-MSC

Selection of Rosetta decoys using ProQ2 and MESHI-MSC

S. Mirzaei1, T. Sidi2, C. Keasar2, D. E. Kim3, D. Baker3, B. Wallner4, and S.N. Crivelli5


1–California State Polytechnic University, Pomona; Industrial and Manufacturing Engineering Department, 2–
Department of Computer Science, Ben-Gurion University of the Negev, Israel, 3 - Dept. of Biochemistry, University
of Washington, USA, 4 - Division of Bioinformatics, Department of Physics, Chemistry and Biology, Link öping
University, Linköping, Sweden, 5–Lawrence Berkeley National Laboratory, USA
sncrivelli@lbl.gov

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings
together researchers through the science gateway http://wefold.nersc.gov/ and provides
computing and storage resources through the National Energy Research Scientific Computing
(NERSC) center. WeFold enables the interaction among groups that work on different
components of the protein structure prediction pipeline. The combination of these components
creates hybrid protein structure prediction pipelines, each submitting its own models. Here we
describe the wfRosetta-ProQ-MESHI-MSC pipeline, which applied a two stage selection process
to domain decoys generated by Robetta.
Methods
Decoy generation: Decoy sets were generated from the Robetta server1 (see BAKER-
ROSETTASERVER abstract for details). Given the target sequence, Robetta first predicts
domain boundaries by identifying PDB templates with optimal sequence similarity and structural
coverage to the target through an iterative process. For each iteration, HHSearch2, Sparks3, and
RaptorX4 are used to identify templates and generate alignments. The target sequence is threaded
onto the template structures to generate partial-threaded models, which are then clustered to
identify distinct topologies that are ranked based on the likelihood of the alignments. Regions of
the target sequence that are not covered by the partial-threads or are not similar in structure
within the top ranked cluster are passed on to the next search iteration. Through this iterative
process, non-overlapping clusters are identified that, together, cover the full length of the target
sequence and domain boundaries are assigned at the transitions between the clusters. The
modeling difficulty of each domain is determined by the degree of structural consensus between
the top ranked partial threads from each alignment method. For each predicted domain, Robetta
uses the Rosetta comparative modeling protocol, RosettaCM5, which recombines structural
elements from the clustered partial-threads and models missing segments using a combination of
fragment insertion and mixed torsion-Cartesian space minimization. Conformational sampling is
performed using the Rosetta low-resolution score function6 with spatial restraints that are
generated separately from each cluster7. If enough co-evolutionary sequence data exists to
accurately predict residue- residue contacts using GREMLIN8 the clusters are re-ranked using
this information, and the spatial restraints are supplemented with the predicted contacts. For
difficult domains, models are also generated using the Rosetta fragment assembly methodology6
(RosettaAbinitio), and if GREMLIN contacts are predicted, they are used as restraints for

229
sampling and refinement. All models are refined using a relax protocol9 that minimizes the
Rosetta full-atom energy10 in torsion and Cartesian space to allow bond angle flexibility. Up to
four RosettaCM decoy sets were generated for each domain and all RosettaCM models were
provided. Two RosettaAbinitio decoy sets were provided for each difficult domain, the top 5%
all-atom Rosetta scoring models and the top 15% contact order11 scoring models. Large scale
sampling was possible through Rosetta@home.

Filtering: The accuracy of all generated domain models were estimated using ProQ212 that was
recently implemented as a scoring function in Rosetta13, and the top 1,000 for each domain were
selected. In total, 32,474,636 domain models were scored using ProQ2 in CASP12.
Feature extraction: Top 1000 comparative modeling models and top 1000 ab-initio models (the
latter, when available) of each predicted domain were downloaded from the weFold server. They
were standardized in terms of MESHI features15,16 by scwrl414 rotamer optimization followed by
MESHI energy minimization. 106 structural features were extracted from each decoy and
uploaded to the weFold server. These features are detailed in the MESHI_SERVER abstract.
Final selection: A scoring function based on Support Vector Machine (SVM) was used for final
selection16. Furthermore, a backward feature elimination was implemented. To this end, we
started with the set of MESHI features and studied the result by removing one feature at a time.
This elimination process was repeated until no further improvement was observed. For CASP12
the objective was loss minimization. We used SVM and a 10-fold cross validation for testing the
result. A grid search was used to fine tune the parameters of SVM around the parameters
suggested in [16].
Availability
The MESHI software package (version 9.29, which was used in CASP12) is available at:
https://www.dropbox.com/sh/mb31bjdvvydhuzh/AADVcclTZKtFiSl6I9hBx8Dxa?dl=0

1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the
Robetta server. Nucleic Acids Res 32, W526-W531.
2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21
(7), 951-960.
3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and
template-based modeling by employing probabilistic-based matching between predicted one-
dimensional structural properties of the query and corresponding native properties of
templates. Bioinformatics 27 (15), 2076-2082.
4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by
statistical inference. Proteins 79, 161-171.
5. Song,Y. et al (2013). High-resolution comparative modeling with RosettaCM. Structure 21
(10), 1735- 1742.
6. Leaver-Fay,A. et al. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the
Simulation and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta
comparative modeling. Proteins 79 (8), 2380-2388.
8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based
residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39)

230
15674-15679.
9. Conway,P. et al (2014). Relaxation of backbone bond geometry improves protein energy
landscape modeling. Protein Sci. 23 (1), 47-55.
10. Tyka,M.D. et al. (2011). Alternate states of proteins revealed by detailed energy landscape
mapping. JMB 405, 607-18.
11. Plaxco,K.W., Simons,K.T., Baker,D. (1998). Contact order, transition state placement and the
refolding rates of single domain proteins. J. Mol. Biol. 277 (4) 985–994.
12. Ray,A, Lindahl E, Wallner,B. (2012). Improved model quality assessment using ProQ2.
BMC Bioinformatics 13, 224.
13. Uziela,K, Wallner,B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta.
Bioinformatics. 32(9), 1411-3
14. Krivov,G. et al. (2009) Improved prediction of protein side-chain conformations with
SCWRL4. Proteins, 77, 778–795.
15. Kalisman,N.,et al. (2005). MESHI: a new library of Java classes for molecular modeling.
Bioinformatics 21:3931-3932.
16. Mirzaei,S. et al. (2016) Purely Structural Protein Scoring Functions Using Support Vector
Machine and Ensemble Learning. IEEE/ACM Transactions on Computational Biology and
Bioinformatics, in press.

231
wfRstta-PQ2-Seder

Eshel Faraggi 1, Andrzej Kloczkowski 2


1- IUPUI, 2- OSU, NCH

We have participated in the Critical Assessment of Protein Structure Prediction (CASP)


experiment with four prediction procedures. The procedure described in this abstract is labeled as
group "wfRstta-PQ2-Seder", number 067. This method is based on new version of the Seder
program [1,2], with new and improved input features as will be described in an upcoming
manuscript. For this procedure Seder was trained with soft and hard protein targets. We use
CASP5 through CASP10 server models for training data, and CASP11 server models as a test
set. That is, in this case we train over all CASP targets and optimize the prediction on both hard
and soft CASP11 targets. This version of Seder is then used to pick among all CASP12
submitted server models. In addition we include models from the WeFold experiment [4] in the
pool of candidate protein models. To estimate the B-factors, for the protein models we used the
following equation: B-factor = 300 * SPXASA / ( 1 + model-residue-depth), with SPXASA the
SPINE-X [3] predicted accessible surface area, and model-residue-depth is the residue depth
reported from the program DEPTH [5]. This model came from approximately fitting a
distribution of experimental B-factors.

1. Faraggi, Eshel, and Andrzej Kloczkowski. "A global machine learning based scoring function
for protein structure prediction." Proteins: Structure, Function, and Bioinformatics 82.5
(2014): 752-759.
2. Manuscript in preparation.
3. Faraggi, Eshel, et al. "SPINE X: improving protein secondary structure prediction by multistep
learning coupled with prediction of solvent accessible surface area and backbone torsion
angles." Journal of computational chemistry 33.3 (2012): 259-267.
4. Khoury, George A., et al. "WeFold: a coopetition for protein structure prediction." Proteins:
Structure, Function, and Bioinformatics 82.9 (2014): 1850-1868.
5. Tan, Kuan Pern, Raghavan Varadarajan, and Mallur S. Madhusudhan. "DEPTH: a web server
to compute depth and predict small-molecule binding cavities in proteins." Nucleic acids
research 39.suppl 2 (2011): W242-W248.

232
Yang-Server

Residue-residue contact prediction by integrating template-based modeling and ab-initio


contact prediction methods

Qiqige Wuyun, Wei Zheng, Jianyi Yang


School of Mathematical Sciences, Nankai University, Tianjin, 300071, China
yangjy@nankai.edu.cn

Methods
The CASP11 experiment witnessed a significant improvement of residue-residue contact
prediction methods, which was achieved with the idea of coevolution (also called direct coupling
or evolutionary coupling) 1. In a recent large-scale comparative assessment of various contact
prediction methods, it was shown that the best-performing methods are the consensus based ones
that combine coevolution methods with machine learning based methods 2. Inspired by this
result, we proposed a pipeline to improve the accuracy further by combining existing contact
prediction methods with template-based modeling.
First, the I-TASSER Suite 3 is used for template identification and structure modeling. The
residue-residue contacts in query are derived from the first I-TASSER model. The initial contact
probability (icp) for a pair of residues (i, j) is calculated as 1/[1+e(d-8)], where d is the distance
between the two residues. This probability may not be reliable for residues that are poorly
modeled. The ResQ algorithm is used to predict residue-specific quality of the I-TASSER
models 4. For a pair of residues, if there is at least one with low modeling quality (i.e., ResQ
score >1), the initial contact probability is down-scaled with the ResQ score to
icp/[ResQ(i)+ResQ(j)], which is used as the final probability for template-based contact
prediction.
Second, as recommended in our assessment of contact prediction methods 2, a selected set of
an-initio contact prediction methods are used to predict the residue contacts in query, including
SVMSEQ 5, BETAcon 6, DNcon7, CCMpred 8, MetaPSICOV 9 and PconsC2 10. The predicted
contacts by these methods are combined by support vector machines.
Third, the two sets of predicted contacts from the first and second steps are combined by
weighted linear combination. In the combination, the weights w1 for template-based prediction
for easy, medium and hard targets are 0.8, 0.5, and 0.1, respectively. For contacts predicted by
the ab-initio methods, the corresponding weight is 1- w1. The contacts are sorted according to the
combined probability.

Results
We collected 23 CASP12 targets that have released PDB structures to evaluate the
performance of the proposed method. The average precisions for the top L/5 short-, medium- and
long-range contact predictions are 70.5%, 64.2%, and 61.9%, respectively, which are
consistently higher than the corresponding precision of a locally installed version of
MetaPSICOV (i.e., 65.8%, 57.3%, and 56.6%).

1. Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New


encouraging developments in contact prediction: Assessment of the CASP11 results.

233
Proteins 2015.
2. Wuyun Q, Zheng W, Peng Z, Yang J. A large-scale comparative assessment of methods for
residue-residue contact prediction. Briefings in Bioinformatics 2016;submitted.
3. Yang J, Yan R, Roy A, Xu D, Poisson J, Zhang Y. The I-TASSER Suite: protein structure
and function prediction. Nature methods 2015;12(1):7-8.
4. Yang J, Wang Y, Zhang Y. ResQ: An Approach to Unified Estimation of B-Factor and
Residue-Specific Error in Protein Structure Prediction. J Mol Biol 2016;428(4):693-701.
5. Wu S, Zhang Y. A comprehensive assessment of sequence-based and template-based
methods for protein contact prediction. Bioinformatics 2008;24(7):924-931.
6. Cheng J, Baldi P. Three-stage prediction of protein β-sheets by neural networks,
alignments and graph algorithms. Bioinformatics 2005;21(suppl 1):i75-i84.
7. Eickholt J, Cheng J. Predicting protein residue-residue contacts using deep networks and
boosting. Bioinformatics 2012.
8. Seemayer S, Gruber M, Söding J. CCMpred—fast and precise prediction of protein
residue–residue contacts from correlated mutations. Bioinformatics 2014.
9. Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: Combining coevolution
methods for accurate prediction of contacts and long range hydrogen bonding in proteins.
Bioinformatics 2014.
10. Skwark MJ, Raimondi D, Michel M, Elofsson A. Improved Contact Predictions Using the
Recognition of Protein Like Contact Patterns. PLoS Comput Biol 2014;10(11):e1003889.

234
YASARA

The YASARA homology modeling module V4.0 with new profile and threading methods for
remote template recognition

E. Krieger
CMBI 260, NCMLS, Radboud University Nijmegen Medical Center, PO Box 9101, 6500 HB Nijmegen, the
Netherlands
elmar@yasara.org

In previous CASPs, the YASARA Structure server (www.yasara.org/homologymodeling)


submitted predictions only for classic homology modeling targets, where template identification
was easy, since high-resolution homology modeling including ligands is one of the main
applications of YASARA. For CASP12, remote fold recognition methods were developed, which
often helped to identify useful templates, but equally often failed. Since CASP ranking is usually
based on summed up GDT_TS scores, also the failures were submitted, which would however
not be used in practice, since they are automatically classified as trash in YASARA's homology
modeling report.

Methods
As in previous CASPs1, our method targets homology modeling with a focus on high-resolution
refinement, new folds can hardly be predicted. First PsiBLAST is run with Uniref90 profiles to
identify potential templates. If no reliable hit is found, the remote fold recognition procedure is
started, which scans a library of ~60000 representative PDB structures using a sensitive profile-
profile alignment with the Smith&Waterman algorithm. The match between PSI-Predicted2 and
actual secondary structure is included in the score, the top hits are validated by building fast
approximate all-atom models, to make sure that the aligned fragments pack together well.
High-resolution models are built for the top 10 templates, using stochastic3 profile-profile
alignments including SSALN features4 to arrive at up to five alternative high-scoring target-
template alignments, building models for all of them (using SCWRL5 rotamer libraries, but
additional energy terms), and scoring them. The best parts of the up to 50 models are fused to
form a hybrid model. The following special features were handled automatically: inclusion of
ligands in the model (as long as they interact well and stabilize the structure), automatic
oligomerization to capture stabilizing effects of quaternary structure and pH-dependent hydrogen
bonding networks that include ligands to aid hires refinement.
The best-scoring model was subjected to high-resolution refinement, running 100 short
MD simulations in parallel with the partly knowledge-based YASARA force field1. The best
refined model was submitted as model 1, the unrefined model as model 2, and models 3 to 5
were based on alternative alignments and/or alternative templates.

Results
The recipe above yielded homology models for essentially all CASP12 targets, many of which
where classified as trash, but submitted anyway for reasons explained above. The whole
procedure was implemented as a fully automatic server, requiring human intervention only for
occasional bug fixes.

235
Availability
The homology modeling module described here is available as part of YASARA Structure from
www.yasara.org

1. Krieger, E., Joo, K., Lee, J., Lee, J., Raman, S., Thompson, J., Tyka, M., Baker, D., Karplus,
K. (2009). Improving physical realism, stereochemistry, and side-chain accuracy in homology
modeling: Four approaches that performed well in CASP8. Proteins 77 Suppl 9, 114-122
2. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring
matrices. J.Mol.Biol. 292, 195-202
3. Mueckstein, U., Hofacker, I.L. and Stadler, P.F. (2002). Stochastic pairwise alignments.
Bioinformatics 18 Sup2, 153-160
4. Qiu, J. and Elber, R. (2006). SSALN: An alignment algorithm using structure-dependent
substitution matrices and gap penalties learned from structurally aligned protein pairs.
Proteins 62, 881-891
5. Canutescu, A.A., Shelenkov, A.A. and Dunbrack, R.L. Jr. (2003). A graph-theory algorithm for
rapid protein side-chain prediction.
Protein Sci. 12, 2001-2014.

236
Zhang

Protein structure predictions by Zhang human group in CASP12

Chengxin Zhang1, Baoji He1,2, Yang Zhang1


1Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of
Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109;2Institute of Theoretical Physics, The Chinese Academy of
Sciences, Beijing 100080
yangzhanglab@umich.edu

Methods
The structure prediction of the Zhang human group in CASP12 is based on the I-TASSER
pipeline1-3 that is identical to that used in the Zhang-Server group, except that the whole set of
structure models generated by the CASP servers, instead of the LOMETS templates, were used
as the starting models of the I-TASSER pipeline. One purpose of our participating in the human
section is to examine the impact of different initial threading templates on the final models of the
I-TASSER pipeline.
Similar to the Zhang-Server pipeline, for the targets that were deemed by LOMETS4 as Hard
and Very-Hard targets, the models generated by QUARK5 and NNB-QUARK (He et al, in
preparation) are used to sort and re-rank the CASP server models based on their TM-score to the
ab initio folding models, under the hypothesis that a match of models from template-based
modeling and ab initio folding is significant and often indicates the correctness of the folds. The
consensus restraints are then extracted from both ab initio structural models and the CASP server
models, which are used to guide the I-TASSER structure assembly simulations. In addition, the
sequence-based contact maps, generated by the recently developed NN-BAYES program,6 are
incorporated into the I-TASSER force field. The final models were selected by SPICKER7 from
the simulation trajectories, which are further refined at atomic level by the fragment-guided
molecular dynamic (FG-MD) simulations.8
For multiple-domain proteins, ThreaDom9 was used to predict the domain boundary and
linker regions from the LOMETS threading alignments. Full-length models are assembled from
the domain structures built individually by I-TASSER, using the full-chain model as the
reference of the domain orientations. The procedure is fully automated.

Availability
The on-line I-TASSER server and the standalone package are available at
http://zhanglab.ccmb.med.umich.edu/I-TASSER.

1. Roy, A.; Kucukural, A.; Zhang, Y., I-TASSER: a unified platform for automated protein
structure and function prediction. Nat Protoc 2010, 5, 725-38.
2. Wu, S.; Skolnick, J.; Zhang, Y., Ab initio modeling of small proteins by iterative TASSER
simulations. BMC Biol 2007, 5, 17.
3. Yang, J.; Yan, R.; Roy, A.; Xu, D.; Poisson, J.; Zhang, Y., The I-TASSER Suite: protein
structure and function prediction. Nature Methods 2015, 12, 7-8.
4. Wu, S.; Zhang, Y., LOMETS: A local meta-threading-server for protein structure prediction.
Nucl. Acids. Res. 2007, 35, 3375-3382.

237
5. Xu, D.; Zhang, Y., Ab initio protein structure assembly using continuous structure fragments
and optimized knowledge-based force field. Proteins 2012, 80, 1715-35.
6. He, B.; Mortuza, G.; Wang, Y.; Y., Z., NN-BAYES: Predicting protein contact maps using
neural network training coupled with naive Bayes classifiers 2016, submitted.
7. Zhang, Y.; Skolnick, J., SPICKER: A clustering approach to identify near-native protein
folds. J Comput Chem 2004, 25, 865-71.
8. Zhang, J.; Liang, Y.; Zhang, Y., Atomic-level protein structure refinement using fragment-
guided molecular dynamics conformation sampling. Structure 2011, 19, 1784-95.
9. Xue, Z.; Xu, D.; Wang, Y.; Zhang, Y., ThreaDom: extracting protein domain boundary
information from multiple threading alignments. Bioinformatics 2013, 29, i247-i256.

238
Zhang_Contact

NN-BAYES: Combining Neural Network and Naïve Bayes Classifier for Sequence-Based
Protein Contact prediction

Baoji He1,2, Chengxin Zhang2, Yanting Wang1, Yang Zhang2


1-Institute of Theoretical Physics, The Chinese Academy of Sciences, Beijing 100080;2-Department of
Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of Michigan, 100
Washtenaw Ave, Ann Arbor, MI 48109
yangzhanglab@umich.edu

We developed a method, NN-BAYES, to generate ab initio contact prediction from query


sequences. NN-BAYES was tested in CASP12 as “Zhang_Contact” in the automated Server
section. In NN-BAYES, the naïve Bayes classifier (NBC) is used to combine the contact
prediction from multiple contact prediction programs. The posterior probability of the NBC is
then used, together with a set of intrinsic sequence-based features, to train the contact prediction
through neural network.

Methods
To train the NN-BAYES predictor, we collected 517 non-redundant proteins from the PDB,
which have a pair-wise sequence identity <25%, length between 50 and 300 AA, and resolution
better than 3 Å. Based on the experimental structures, residue pairs are classified as “contact”
and “non-contact”. A sliding window of 11 residues is selected for enhancing the stability of
feature selection. Two categories of features are designed to train NN-BAYES.
Sequence based features. Six types of features are extracted from the query sequence: (1) A
residence feature (=0 or 1) to label whether the target residues go beyond the target sequence.
Given the window size, this results in 22 (=11*2) features for each pair of residues. (2)
Confidence score of secondary structure prediction for helix, beta and coil (3*22=66 features),
predicted by PSSpred1. (3) Confidence score of solvent accessibility (22 features), predicted by
MUSTER2. (4) Shannon entropy (22 features), calculated from the multiple sequence alignment
matrix from PSI-BLAST search3. (5) Sequence separation for long-range contacts and the log
length of protein (2 features). (6) Composition of residues from PSI-BLAST PSSM matrix
(22*21=462 features).
Naïve Bayes classifier predictions. Starting from the query sequence, eight individual
programs are used to generate residue-residue contact predictions, which include three machine-
learning methods (BETACON4, SVMCON5 and SVMSEQ6), three coevolution methods
(mfDCA7, PSICOV8 and CCMpred9), and two meta-server methods (STRUCTCH10 and
MetaPSICOV11). The posterior probability of the residue contacts is then derived by the naïve
Bayes classifier from the conditional probability of the eight methods, which are used as
additional training feature of NN-BAYES. For a pair of residues, NBC contains 11*11=121
features.
Overall, there are 717 features (=[22+66+22+22+2+462]+121) collected in NN-BAYES. The
neural network training was performed by the Weka data mining package12, where 150 hidden
units and one output unit were selected. To enhance the specificity, short-, medium- and long-
range contacts are trained separately. For the short- and medium-range contacts, the training
dataset contains all the residue pairs from the 517 training proteins. But for the long-range ones,

239
there are more than 5 million residue pairs, training of which is beyond the storage of the current
computing power. We therefore construct the long-range contact training set that has the contact
sample from all the training proteins but with the noncontact sample from 1 million of long-
range contact pairs randomly selected from the 517 training proteins; the ratio of the contact and
noncontact pairs equals to 2:23 for the final long-range contact sample. Here, we have tried to
increase the size of the training data but found that the results do not have obvious changes,
indicating that 1 million is sufficient to achieve a stable training result.

Availability
NN-BAYES server is freely available at: http://zhanglab.ccmb.med.umich.edu/NN-BAYES.

1. Yan, R.; Xu, D.; Yang, J.; Walker, S.; Zhang, Y., A comparative assessment and analysis of
20 representative sequence alignment methods for protein structure prediction. Sci Rep 2013,
3, 2619.
2. Wu, S.; Zhang, Y., MUSTER: Improving protein sequence profile-profile alignments by
using multiple sources of structure information. Proteins 2008, 72, 547-56.
3. Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.
J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research 1997, 25, 3389-402.
4. Cheng, J.; Baldi, P., Three-stage prediction of protein β-sheets by neural networks,
alignments and graph algorithms. Bioinformatics 2005, 21, i75-i84.
5. Cheng, J.; Baldi, P., Improved residue contact prediction using support vector machines and
a large feature set. BMC bioinformatics 2007, 8, 1.
6. Wu, S.; Zhang, Y., A comprehensive assessment of sequence-based and template-based
methods for protein contact prediction. Bioinformatics 2008, 24, 924-931.
7. Kaján, L.; Hopf, T. A.; Kalaš, M.; Marks, D. S.; Rost, B., FreeContact: fast and free software
for protein contact prediction from residue co-evolution. BMC bioinformatics 2014, 15, 1.
8. Jones, D. T.; Buchan, D. W.; Cozzetto, D.; Pontil, M., PSICOV: precise structural contact
prediction using sparse inverse covariance estimation on large multiple sequence alignments.
Bioinformatics 2012, 28, 184-190.
9. Seemayer, S.; Gruber, M.; Söding, J., CCMpred—fast and precise prediction of protein
residue–residue contacts from correlated mutations. Bioinformatics 2014, 30, 3128-3130.
10. Sun, H. P.; Huang, Y.; Wang, X. F.; Zhang, Y.; Shen, H. B., Improving accuracy of protein
contact prediction using balanced network deconvolution. Proteins 2015, 83, 485-496.
11. Jones, D. T.; Singh, T.; Kosciolek, T.; Tetchner, S., MetaPSICOV: combining coevolution
methods for accurate prediction of contacts and long range hydrogen bonding in proteins.
Bioinformatics 2015, 31, 999-1006.
12. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. H., The WEKA
data mining software: an update. SIGKDD Explor. Newsl. 2009, 11, 10-18.

240
Zhang-Refinement

Atomic-level Protein Structure Refinement by Homology Template Guided Molecular


Dynamics Simulations

Wei Zheng1,2, Chengxin Zhang1, Yang Zhang1


1School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China, 2Department of Computational
Medicine and Bioinformatics, Department of Biological Chemistry, University of Michigan, 100 Washtenaw Ave,
Ann Arbor, MI 48109
yangzhanglab@umich.edu

We participated in CASP12 structure refinement experiment as a human group “Zhang-


Refinement”, with our protein structure refinement pipeline which is extended from the
Fragment-Guided Molecular Dynamics simulation (FG-MD1). The new pipeline includes four
steps:
(1) Collection of distance restraints from homology templates. From the initial model
released by CASP, a collection of homology templates with a TM-score >0.5 is identified by
TM-align2 from a non-redundant PDB database. The top 10 templates ranked by TM-score3, 4
will be selected. For each selected template, a Cα distance map will be collected from the TM-
align aligned regions. Compared to the distance map of the initial model, we will remove all
𝑡𝑒𝑚𝑝𝑙𝑎𝑡𝑒 𝑖𝑛𝑖𝑡𝑖𝑎𝑙
distance restraints that have |𝑑𝑖,𝑗 − 𝑑𝑖,𝑗 | > 8 Å from the template-based distance maps.
The remaining distance map from the top 10 templates will be used in Step 3 to guide the
molecular dynamics (MD) simulations.
(2) Release of Cα clashes. In this step, we examine the initial model on whether the initial
model contains any Cα clashes (i.e. Cα-Cα distance <3.6Å). If yes, a fast (1 ns) free MD
simulation, based on the LAMMPS5 simulated annealing package with the AMBER996 force
field, will be performed for removing the steric clashes.
(3) Distance map guided MD refinement simulations. Ten independent atomic-level
molecular dynamics simulation (10×20ns) based on LAMMPS will be performed to refine the
models returned from Step 2. The simulations are constrained by a combination of the
AMBER99 force field and the distance map term collected in Step 1. For each MD simulation
run, the distance map from one template is utilized. Each MD simulation run creates 20,000
structural decoys.
(4) Model selection. Two models are initially selected from each MD simulation: one is from
the average of the coordinates of the last ten structural decoys and another is from the center of
the largest cluster of the 20,000 decoys by SPICKER7; this will result in 20 refined models. The
final model, which have the best combination score of the hydrogen-bonding score (HB-score),
MolProbity score (MP-score)8 and clash number, is selected for submission. Here, Scrwl4.09 is
used to repack the side chain of initial model, with the repacked initial model then used as the
reference conformation to calculate the HB-score, i.e. HB-score=#(consensus H-Bonds)/#(H-
bonds in reference).

Availability: http://zhanglab.ccmb.med.umich.edu/FG-MD/

241
1. Zhang, J.; Liang, Y.; Zhang, Y., Atomic-Level Protein Structure Refinement Using
Fragment-Guided Molecular Dynamics Conformation Sampling. Structure 2011, 19, 1784-
1795.
2. Zhang, Y.; Skolnick, J., TM-align: a protein structure alignment algorithm based on the TM-
score. Nucleic Acids Research 2005, 33, 2302-2309.
3. Zhang, Y.; Skolnick, J., Scoring function for automated assessment of protein structure
template quality. Proteins: Structure, Function, and Bioinformatics 2007, 68, 1020-1020.
4. Xu, J.; Zhang, Y., How significant is a protein structure similarity with TM-score = 0.5?
Bioinformatics 2010, 26, 889-895.
5. Plimpton, S., Fast Parallel Algorithms for Short-Range Molecular Dynamics. Journal of
Computational Physics 1995, 117, 1-19.
6. Wang, J.; Cieplak, P.; Kollman, P. A., How well does a restrained electrostatic potential
(RESP) model perform in calculating conformational energies of organic and biological
molecules? Journal of Computational Chemistry 2000, 21, 1049-1074.
7. Zhang, Y.; Skolnick, J., SPICKER: A clustering approach to identify near-native protein
folds. Journal of Computational Chemistry 2004, 25, 865-871.
8. Davis, I. W.; Murray, L. W.; Richardson, J. S.; Richardson, D. C., MOLPROBITY: structure
validation and all-atom contact analysis for nucleic acids and their complexes. Nucleic acids
research 2004, 32, W615-9.
9. Krivov, G. G.; Shapovalov, M. V.; Dunbrack, R. L., Improved prediction of protein side-
chain conformations with SCWRL4. Proteins: Structure, Function, and Bioinformatics 2009,
77, 778-795.

242
Zhang-Server

Protein structure predictions by I-TASSER in CASP12

Chengxin Zhang1, Baoji He1,2, Yang Zhang1


1-Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of
Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109; 2-Institute of Theoretical Physics, The Chinese Academy of
Sciences, Beijing 100080
yangzhanglab@umich.edu

Methods

The Zhang-Sever structure prediction in CASP12 is built on the I-TASSER pipeline.1-3 The
target sequence is threaded by LOMETS4 through the PDB library to identify putative structure
templates. Continuously aligned fragments are excised from the threading templates, which are
used to assemble full-length models by iterative replica-exchange Monte Carlo (REMC)
simulations as described previously.1, 3, 5 The simulation decoys are clustered by SPICKER,6 with
the selected models further refined at atomic-level by the fragment-guided molecular dynamic
(FG-MD7) simulations.
In addition to the classic I-TASSER pipeline, several approaches were recently developed
and integrated into I-TASSER to enhance its ability of structure modeling for distant-homology
targets. First, the top models generated by the QUARK ab initio folding8 were merged into the
threading template pool, which were used as the starting conformations of I-TASSER
simulations. For the Very Hard targets where there are almost no correlation between the
template quality and the threading alignment score, the threading templates were sorted and re-
ranked based on their distance to the ab initio models aiming to pick up templates with correct
fold. Second, since the hard targets generally lack global templates, the sequences were broken
into segments of 2-4 consecutive secondary structure elements which were then threaded through
the PDB by the segmental threading tool SEGMER9 to identify super-secondary structure motifs;
these motif alignments are used to provide medium-range distance restraints to I-TASSER.
Third, the recently developed NN-BAYES method10 is used to generate residue contact maps that
are used to guide the structure assembly simulations. These additional strategies are mainly
applied to the targets that are deemed by LOMETS4 as Hard or Very Hard targets, whereas the
first approach taking QUARK fragments was only applied on the alpha- or alpha/beta-proteins
since the current ab initio folding approach does not work well for folding beta-proteins.
For multiple-domain proteins, ThreaDom11 was used to predict the domain boundary and
linker regions from the LOMETS meta-server threading alignments. Full-length models are
assembled from the domain structures built individually by I-TASSER, using the full-chain I-
TASSER model as the reference of the domain orientations.

Availability
The on-line I-TASSER server and the standalone package are available at
http://zhanglab.ccmb.med.umich.edu/I-TASSER.

243
1. Wu, S.; Skolnick, J.; Zhang, Y., Ab initio modeling of small proteins by iterative TASSER
simulations. BMC Biol 2007, 5, 17.
2. Zhang, Y., Template-based modeling and free modeling by I-TASSER in CASP7. Proteins
2007, 69, 108-117.
3. Roy, A.; Kucukural, A.; Zhang, Y., I-TASSER: a unified platform for automated protein
structure and function prediction. Nat Protoc 2010, 5, 725-38.
4. Wu, S.; Zhang, Y., LOMETS: A local meta-threading-server for protein structure prediction.
Nucl. Acids. Res. 2007, 35, 3375-3382.
5. Yang, J.; Yan, R.; Roy, A.; Xu, D.; Poisson, J.; Zhang, Y., The I-TASSER Suite: protein
structure and function prediction. Nature Methods 2015, 12, 7-8.
6. Zhang, Y.; Skolnick, J., SPICKER: A clustering approach to identify near-native protein
folds. J Comput Chem 2004, 25, 865-71.
7. Zhang, J.; Liang, Y.; Zhang, Y., Atomic-level protein structure refinement using fragment-
guided molecular dynamics conformation sampling. Structure 2011, 19, 1784-95.
8. Xu, D.; Zhang, Y., Ab initio protein structure assembly using continuous structure fragments
and optimized knowledge-based force field. Proteins 2012, 80, 1715-35.
9. Wu, S.; Zhang, Y., Recognizing protein substructure similarity using segmental threading.
Structure 2010, 18, 858-67.
10. He, B.; Mortuza, G.; Wang, Y.; Y., Z., NN-BAYES: Predicting protein contact maps using
neural network training coupled with naive Bayes classifiers 2016, submitted.
11. Xue, Z.; Xu, D.; Wang, Y.; Zhang, Y., ThreaDom: extracting protein domain boundary
information from multiple threading alignments. Bioinformatics 2013, 29, i247-i256.

244
CAPRI: Bates_BMM

Descending the protein binding funnel – Docking, scoring and refinement of protein
complexes in CASP-CAPRI 2016

E. Pfeiffenberger and P.A. Bates


Biomolecular Modelling Laboratory, The Francis Crick Institute, 1 Midland Road, London NW1 1AT, UK
paul.bates@crick.ac.uk
The docking and optimization of protein complexes remains challenging. Often complex
formation involves conformational transitions from unbound to bound which are still not
completely covered by current docking methods1. Here we present our integrated approach of
docking, scoring and refinement applied to targets in the CASP-CAPRI round in 2016.

Methods
Our three step-protocol can be described as follows:

1) Docking.
For all predicted homo-oligomeric structures we used a modification to our binary protein-
docking algorithm SwarmDock2. Our method uses the principles of particle swarm optimization
to search the parameter docking space. The innovation with the new algorithm is to treat each
particle within the swarm as an instance of a packed homo-oligomer, constrained by the
appropriate symmetry operators. The objective is to optimize the particle space in order to find
the most energetically favorable homo-oligomer. Particles move through a multi-parameter space
by the optimization of two sets of parameters: orientations and translations of the monomeric
units relative to the imposed symmetry and a linear combination of normal modes that adjust the
conformation of each monomer, in the presence of the other monomers, in this simultaneous
docking process. For hetero-oligomeric structures we employed our standard SwarmDock
protocol. The monomeric models used for docking were selected from the CASP12 server
tarballs.

2) Scoring
The docked solutions were clustered with a 10 Å cut-off and follows the cluster ranking protocol
described in Ref. 3. Essentially, the method employs 109 molecular descriptors important for
protein-protein interactions (e.g. residue-contact potentials, atomic-contact potentials, solvation
energy functions, etc.) that are integrated in the machine learning model extremely randomized
tree classifier. The protocol is based on first locally enriching clusters with additional poses by
SwarmDock, the clusters are then characterized using features describing the distribution of
molecular descriptors within the cluster, which are combined into a pairwise cluster comparison
model to discriminate near-native from incorrect clusters. Following this cluster ranking,
individual models from the best ranking clusters are selected with ZRANK4. The same ranking
protocol was applied to the CAPRI scoring rounds.

3) Refinement
For the best ranked clusters physics-only refinement was applied to the model with the best
ZRANK score in each respective cluster in order to further improve the quality of the protein

245
complex. Essentially, the method models the induced-fit mechanism and is based on
metadynamics where our collective variable (CV) defines the contact map space (CMS) to
generate models with improved ligand and interface conformations, and a higher fraction of
native contacts.
The method makes use of the observed interface residue-residue contacts of the models in
a cluster. The residue-residue contacts between a receptor and a ligand are identified with a Cα or
Cβ distance below 8 Å. From these contacts a contact map (CM) is generated, namely CMif , that
contains the list of unique contacts with the lowest Cα/Cβ distance. This CMif defines our CV
describing the CMS:
2
𝐶𝑉1(𝑅) = 1/𝑁 ∑𝛾∈𝐶𝑀if (𝐷𝛾 (𝑅) − 𝐷𝛾 (𝑅ref )) (1)
1−(𝑟 /𝑟 0 )𝑛
𝐷𝛾 (𝑅) = 1−(𝑟𝛾/𝑟𝛾0 )𝑚 (2)
𝛾 𝛾
The sigmoidal distance function Dγ(R) is used to quantify the formation of a contact γ in
structure R, where 𝑟𝛾 is the contact distance in structure R and 𝑟𝛾0 is the contact distance in
reference structure Rref which denotes to one of the models from the cluster where the contact
was observed. Variables n and m are constant and set to n=6 and m=10.
The sampling with metadynamics in CMS is performed at 300K for 10ns with 5 replicas
resulting in 50ns sampling data for each target. The sampling of the CMS was performed with
the GROMACS5 plug-in PLUMED6 where a Gaussian addition is deposited every 2ps with
σ=0.5, a bias factor of 10 and an initial height of 5kJ/mol. Snapshots from the trajectories are
taken every 10ps, resulting in 4810 frames in total. Scoring of these frames is based on
reconstructing the free energy surface (FES) by integrating the deposited bias during the
simulation and by computing the ZRANK score. Furthermore, we constructed the combined
scoring-function CSα that uses both normalized energies from FES and ZRANK to score frames.
Where CSα = (1-α)FESN+αZRANKN with an α=0.5 resulting in equal contribution of both terms
for scoring.
Overall, for each cluster 2 models were generated where each model is based on an
average structure from the 20 best scoring frames by using ZRANK or CSα (i.e. ZRANK20 and
CSα20). Each averaged model was subject to a two-step steepest decent minimization procedure.
The first step performed a minimization of the structure in vacuum followed by a minimization
with explicit solvent for 50000 steps in each minimization.

Availability
The SwarmDock webserver is available at https://bmm.crick.ac.uk/~SwarmDock/ . Manuscript
for our cluster ranking protocol is submitted (See Ref. 3). The protein-protein refinement
manuscript is in preparation.

1. Kuroda, D., & Gray, J. J. (2016). Pushing the backbone in protein-protein docking. Structure.
2. Torchala M., Moal I.H., Chaleil R.A.G, Fernandez-Recio, J. & Bates P.A.(2013).
SwarmDock: a server for flexible protein-protein docking. Bioinformatics. 29(6), 807-9.
3. Pfeiffenberger, E., Chaleil, R. A. G., Moal, I. & Bates, P. A. (2016). A machine learning
approach for ranking clusters of docked protein-protein complexes by pairwise cluster
comparison. Proteins: Structure, Function, and Bioinformatics, Manuscript submitted.

246
4. Pierce, B., & Weng, Z. (2007). ZRANK: reranking protein docking predictions with an
optimized energy function. Proteins: Structure, Function, and Bioinformatics, 67(4), 1078-
1086.
5. Pronk, S., Pall, S., et al. (2013). GROMACS 4.5: a high-throughput and highly parallel open
source molecular simulation toolkit. Bioinformatics
6. Tribello, G. A., Bonomi, M., et al. (2014). PLUMED 2: New feathers for an old bird.
Computer Physics Communications, 185(2), 604-613.

247
CAPRI: ClusPro

Performance of ClusPro server in 2016 CASP/CAPRI rounds

Dima Kozakov1, Sandor Vajda2,3, Christine Yueh, Kathryn Porter, Dmitri Beglov2,
Dzmitry Padhorny1, and Arina Nikitina4
1-Laufer Center for Physical and Quantitative Biology, Stony Brook University; Departments of 2-Biomedical
Engineering and 3-Chemistry, Boston University; 4-Moscow Institute of Physics and Technology

The ClusPro server performs rigid body docking using the PIPER program and clusters the 1000
lowest energy structures. The models are ranked according to cluster size. In order to deliver
results to the user within 24 hours of submission, the current implementation of ClusPro does not
include refinement beyond minimizing the energy of structures to remove steric overlaps. In
spite of this limitation, the server has almost 7800 registered users, and run about 150,000 jobs in
the last 3 years.
In the latest rounds of the CASP-CAPRI experiment we have applied ClusPro to predicting the
probable biological units of the “easy” CASP targets, as well as to the prediction of several
hetero-complex structures. Based on provided evaluation, we submitted acceptable or better
server predictions for 7 of the 10 targets, all within the top 6 models. The 7 correctly predicted
targets included two hetero-dimers and five homo-oligomers. In addition, we have recently
developed a docking based method to discriminate between biological and crystallographic
dimers. The method has been implemented as a new option in ClusPro, and we present our
analysis of the dimer targets included in the latest CASP/CAPRI rounds using this approach.
Methods
Model preparation. Since ClusPro requires three-dimensional structures as the input, we have
built a “consensus” model for each target using the 150 server models provided by the CASP
management committee. For each “easy” target most models had the same fold, with variations
in loops and tails. Removal of the uncertain regions resulted in reliable “consensus” models that
were used for docking.
Docking. Our docking approach consists of two steps. The first step is running PIPER, a docking
program that performs systematic search of complex conformations on a grid using the fast
Fourier transform (FFT) correlation approach. The scoring function includes van der Waals
interaction energy, an electrostatic energy term, and desolvation contributions calculated by a
pairwise potential. The second step of the algorithm is clustering the top 1000 structures
generated by PIPER using pairwise RMSD as the distance measure. The radius used in clustering
is defined in terms of Cα interface RMSD. For each docked conformation we select the residues
of the ligand that have any atom within 10 Å of any receptor atom, and calculate the C α RMSD
for these residues from the same residues in all other 999 ligands. Thus, clustering 1000 docked
conformations involves computing a 1000 × 1000 matrix of pairwise Cα RMSD values. Based on
the number of structures that a ligand has within a (default) cluster radius of 9 Å RMSD, we
select the largest cluster and rank its cluster center as number one. The members of this cluster
are removed from the matrix, and we select the next largest cluster and rank its center as number
two, and so on. After clustering with this hierarchical approach, the ranked complexes are
subjected to a straightforward (300 step and fixed backbone) van der Waals minimization using
the CHARMM potential to remove potential side chain clashes. ClusPro outputs the centers of
the 10 largest clusters, which were submitted as predictions.

248
CAPRI: Fernandez-Recio

Combination of Template-Based, Ab Initio Docking and Scoring with pyDock for the
Modeling of Protein Assemblies in the CASP12-CAPRI Challenge

Chiara Pallara, Miguel Romero, Mireia Rosell, Lucía Díaz, Brian Jiménez-García and
Juan Fernández-Recio
Barcelona Supercomputing Center (BSC)
juanf@bsc.es

The combination of CASP and CAPRI experiments is linking two methodological areas such as
protein-protein docking and protein structure modeling, traditionally separated and individually
assessed. This has encouraged many protein docking groups to extend their approaches and
develop new tools to tackle some of the challenges related to both areas, such as the structural
modeling of protein assemblies. The joint challenge between CASP12 and CAPRI Round 37
consisted in the structural modeling and interface residue prediction of a total of 11 oligomers,
including 3 homo-trimers, 2 hetero-dimers, 4 homo-dimers, 1 dimer of hetero-dimers and 1
homo-octamer. We have participated in all proposed targets, both as predictors and as scorers,
with a global 60% success for the evaluated targets.

Methods

Modeling of monomeric subunits


We used the CASP-hosted server predictions as starting points for the structural modeling of all
the individual subunits, although specific strategies were applied for the different targets. For
T110 and T111, we manually chose as input for docking only one of the 20 models provided at
the first-stage model release. For the homo-oligomeric targets (T112, T114-T116, T118-T119),
we automatically selected a set of structures formed by the top 3 CASP models from the
ZHANG, ROSETTA, QUARK and SEOK servers (according to their order of submission) within
the second-stage models release. For the heterodimers (T113 and T120) we used for each subunit
only the top 1 model from each of these servers. For the dimer of hetero-dimers (T117) we used
the top 3 models from these servers for the LGN protein, and all 150 server models provided at
the second-stage model release for the inscuteable protein.

Generation of rigid-body docking poses for the predictors experiment


For all the targets, most of the models were generated by ab initio docking using different
protocols depending on the oligomeric state of the target. For the homo-dimer and homo-trimer
targets (T110-T112, T114, T116 and T119), except for T115 (suspected of asymmetric
dimerization), MZDock1 was used to generate 1500 rigid-body docking poses from each input
structure, with dimeric or trimeric docking depending on the case. For T115 and the hetero-
dimeric targets (T113 and T120), FTDock2 (with electrostatics and 0.7 Å grid resolution) and
ZDOCK 2.13 were used to generate 10,000 and 2000 rigid-body docking poses for each starting
monomer structure, respectively.
For targets where homologous complex structures were found (T110-T112, T118-T120),
some of the submitted models were built based on such homologous templates. Template

249
structures were manually selected according to the sequence identity and the phylogenetic and
structural similarity to the target. For targets T110 and T111, the trimeric structure was built by
applying a standard homology modeling protocol using Modeller 9v6.4 For targets T112, T116,
and T118-T120, the final complex assembly was built by superposition on the template structure
of the most suitable monomer model(s) according to energetic or steric features.
Models for targets T117-T118 had some of their interfaces built by template-based
modeling and the other ones by ab initio docking.

Scoring of rigid-body docking poses for both the predictors and the scorers experiment
We scored the docking models generated by the above described ab initio docking procedure
with our default pyDock protocol,5 according to the total binding energy of all possible
interfaces. For some of the targets (T110-T112) we found experimental information on possible
interface residues, which were used as a final distance-based filtering step. Finally, we eliminated
the redundant predictions and minimized the final ten selected docking models.
In the scorers experiment, we first eliminated all the docking models with a percentage of
secondary structure significantly lower than the one observed in the corresponding set of
structures previously selected as predictors. Then, the same protocol used in predictors was
applied to score the docking models, including a final distance-based filtering step only for T111-
T112. We did not build any template-based model for the scorers experiment.

Predictions of interface residues


For all the targets of the CASP12-CAPRI experiment, participants were also asked to predict the
interface residues. For this, we used our pyDock module pyDockNIP,6 based on the residues
found at the protein-protein interfaces of the lowest-energy docking poses. For targets T113-
T115, the interface predictions were based on the ab initio docking results, while for the rest of
the targets, the predictions were based in a combination of ab initio docking and template-based
models (the higher the sequence identity between the target and the template, the higher the
contribution of the template-based models).

Results
We submitted correct models for 6 of the 10 evaluated targets, both as predictors and as scorers.
The six successful cases as predictors were those for which the majority of the participant groups
also submitted correct models. These were easy cases with a template complex structure
available. In general, we also submitted correct models as scorers for those cases that were easy
for all participants, except for T119 (almost acceptable model). Interestingly, in targets T117 and
T120 we were one of the few participants (two and five successful groups, respectively) that
submitted correct models as scorers.

Availability
The pyDock5 program is available for academic use as a GNU/Linux binary and as a web server
(https://life.bsc.es/pid/pydock/).

1. Pierce, B., Tong, W. & Weng, Z. M-ZDOCK: a grid-based approach for Cn symmetric
multimer docking. Bioinformatics 21, 1472–1478 (2005).
2. Gabb, H. A., Jackson, R. M. & Sternberg, M. J. Modelling protein docking using shape

250
complementarity, electrostatics and biochemical information. J. Mol. Biol. 272, 106–120
(1997).
3. Chen, R., Li, L. & Weng, Z. ZDOCK: an initial-stage protein-docking algorithm. Proteins 52,
80–87 (2003).
4. Sali, A. & Blundell, T. L. Comparative protein modelling by satisfaction of spatial restraints.
J. Mol. Biol. 234, 779–815 (1993).
5. Cheng, T. M.-K., Blundell, T. L. & Fernandez-Recio, J. pyDock: electrostatics and desolvation
for effective scoring of rigid-body protein-protein docking. Proteins 68, 503–515 (2007).
6. Grosdidier, S. & Fernández-Recio, J. Identification of hot-spot residues in protein-protein
interactions by computational docking. BMC Bioinformatics 9, 447 (2008).

251
CAPRI: HADDOCK

HADDOCK’s performance in the second joint CASP-CAPRI round

J. Schaarschmidt1, J.P.G.L.M. Rodrigues2, P.I. Koukos1, Z. Kurkcuoglu1, M. Trellet1, L.C. Xue1,


C. Geng1, I. Moreira1, G.C.P. van Zundert1, A. Vangone1, A.S.J. Melquiond1, and
A.M.J.J. Bonvin1
1- Computational Structural Biology group - Bijvoet Center for Biomolecular Research, Faculty of Science, Utrecht
University, 2- Computational Structural Biology, Stanford University,
a.m.j.j.bonvin@uu.nl

We participated with HADDOCK in the second joint CASP-CAPRI round (CAPRI round 37)
under the server category (performing template identification, sequence alignment, homology
modelling, and docking within the server submission deadline) and as scorers. The models for
the various targets were generated using a combination of homology modeling and (refinement-)
docking with the HADDOCK web server 1,2. For scoring we followed our standard and simple
energy-based protocols3.

Methods
For most of our predictions, models were built with MODELLER (v9.16) 4 using HHpred5 for
the template search. For the two targets without suitable templates we used models provided by
the I-TASSER web server6 for T113 and a selection of models from the CASP predictions for
T117. The target/template alignment was optimized, if necessary, using manually-curated
multiple sequence alignments or CLUSTAL7 profile alignments of multiple sequences and
structural sequence alignments generated by MUSTANG8. Ten models with the highest DOPE
score9 were selected as starting point for ensemble docking in HADDOCK. Except for Targets
113 and 117 structures were directly modeled as a multimer in MODELLER and only subjected
to the explicit solvent refinement in HADDOCK. For homo-oligomeric structures, symmetry
restraints (C2, C3 or D2 depending on the target, together with non-crystallographic symmetry
restraints) were used during setup of the docking run. For Target 113 we used the functional
region prediction of FRpred10 as restraints to drive the docking with HADDOCK. For target 117
Ca-Ca distance restraints of a partial complex11 in combination with CM and symmetry restraints
were used to guide the docking. A summary of the modelling methodology and restraints used
for each target is provided in Table 1 below. The models for submission were distributed over the
clusters ranked by HADDOCK with the top cluster overrepresented.

Results
In summary HADDOCK was successful in six out of ten targets, submitting two models with
“high” and four models with “acceptable” quality according to CAPRI classification criteria. As
for the remaining four targets, no successful predictions were submitted by any group for Target
114 and Target 116 and only one acceptable prediction was submitted for one of the five
interfaces of Target 117. For Target 113 docking failed due to the large structural difference of
the employed models to the real structure of the second interactor.

252
CAPRI CASP Classification
target target Alignment Template(s) Modelling Docking Restraints of best model
T110 T0860 HHpred 3zpe Modeller Refinement C3+NCS symmetry acceptable
T111 T0867 HHpred 4d0u Modeller Refinement C3+NCS symmetry high
T112 T0881 HHpred+manually 2vtw, 2ium Modeller Refinement C3+NCS symmetry acceptable
curated (Structural+
profile alignment)
T113 T0884+ - - I-Tasser Full protocol FRpred -
T0885
T114 T0875 HHpred 4dpz Modeller Refinement C2+NCS symmetry -
T115* T0877 HHpred 5en2 Modeller Refinement C2+NCS symmetry cancelled
T116 T0893 HHpred 4u7o, 3d36, 4jav, Modeller Refinement -
4biu
T117 T0903+ - - CASP Full protocol CM+CA-CA+C2+ -
T0904 NCS symmetry
T118 T0906 HHpred 3t2b Modeller Refinement C2+NCS symmetry high
T119 T0917 HHpred 3bfj, 3ox4, 5br4, Modeller Refinement C2+NCS symmetry acceptable
1vlj, 3iv7
T120 T0921+ HHpred+manually 4uyp and 4dh2 Modeller Refinement acceptable
T0922 curated

* - TARGET WAS CANCELLED

Availability
The HADDOCK web server is freely available to academic users after registration at
http://haddock.science.uu.nl/.

1. van Zundert, G. C. P., Rodrigues, J. P. G. L. M., Trellet, M., Schmitz, C., Kastritis, P. L.,
Karaca, E., Melquiond, A. S. J., van Dijk, M., de Vries, S. J., Bonvin, A. M. J. J. (2015).
The HADDOCK2.2 Web Server: User-Friendly Integrative Modeling of Biomolecular
Complexes. J Mol Biol. 428, 720–725.
2. de Vries, S. J., van Dijk, M., Bonvin, A. M. J. J. (2010). The HADDOCK web server for
data-driven biomolecular docking. Nat Protoc. 5, 883–897.
3. Vangone, A., Rodrigues, J. P. G. L. M., Xue, L. C., van Zundert, G. C. P., Geng, C.,
Kurkcuoglu, Z., Nellen, M., Narasimhan, S., Karaca, E., van Dijk, M., Melquiond, A. S.
J., Visscher, K., Trellet, M., Kastritis, P. L., Bonvin, A. M. J. J. Sense and Simplicity in
HADDOCK Scoring: Lessons from CASP-CAPRI Round 1. Proteins.
4. Sali, A., Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial
restraints. J Mol Biol. 234, 779–815.
5. Hildebrand, A., Remmert, M., Biegert, A., Söding, J. (2009). Fast and accurate automatic
structure prediction with HHpred. Proteins. 77 Suppl 9, 128–132.
6. Yang, J., Zhang, Y. (2015). I-TASSER server: new development for protein structure and
function predictions. Nucleic Acids Res. 43, W174–W181.
7. Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G., Thompson,
J. D. (2003). Multiple sequence alignment with the Clustal series of programs. Nucleic
Acids Res. 31, 3497–3500.
8. Konagurthu, A. S., Whisstock, J. C., Stuckey, P. J., Lesk, A. M. (2006). MUSTANG: a

253
multiple structural alignment algorithm. Proteins. 64, 559–574.
9. Shen, M.-Y., Sali, A. (2006). Statistical potential for assessment and prediction of protein
structures. Protein Sci. 15, 2507–2524.
10. Alva, V., Nam, S.-Z., Söding, J., Lupas, A. N. (2016). The MPI bioinformatics Toolkit as
an integrative platform for advanced protein sequence and structure analysis. Nucleic
Acids Res. 44, W410–W415.
11. Culurgioni, S., Alfieri, A., Pendolino, V., Laddomada, F., Mapelli, M. (2011). Inscuteable
and NuMA proteins bind competitively to Leu-Gly-Asn repeat-enriched protein (LGN)
during asymmetric cell divisions. Proc Natl Acad Sci USA. 108, 20998–21003.

254
CAPRI: Grudinin

Using machine-learning, symmetry information and exhaustive space exploration for


template-free modeling in CAPRI Round 37

S. Grudinin1,2,3 , G. Pages1,2,3, G. Derevyanko4,5, D.W. Ritchie6


1 Univ. Grenoble Alpes, LJK, 38000 France, 2 CNRS, LJK, 38000 France , 3 Inria Grenoble, France , 4 IBS
Grenoble, France, 5 FZJ Juelich, Germany, 6 Inria Nancy, France
sergei.grudinin@inria.fr

In CASP 12 – CAPRI experiment we have systematically improved our symmetric docking


pipeline with respect to the previous CASP 11 – CAPRI experiment [1]. On the first step, we
systematically docked all 150 stage-2 predictions of the CASP servers. For each of these, we
exhaustively sampled the rigid configurational space in 4 degrees of freedom for cyclic
symmetries and in 6 degrees of freedom for other complexes. For the sampling, we used the
SAM symmetry assembler [2], which is built on top of the Hex docking engine [3]. On the next
step, we re-scored the top-60 x 150 solutions using a first-order minimization, side-chain
reconstruction and the knowledge-based KSENIA potential [4,5]. Then, we clustered the
obtained solutions with RMSD threshold of 5 Å. Finally, we filtered out from the prediction list
complexes with low folding scores. These we obtained using a fold quality assessment potential
that was trained with deep learning. More precisely, we constructed a deep convolutional
network consisting of 23 layers with 3D volumetric grids of protein structures at the initial layer.
We trained it on 1,250 protein folds and their misfolded structures (decoys) maximizing the
correlation between the scores and RMSD of the decoys. As a result, in CASP 12 – CAPRI
experiment we could find at lest one interface in 7 out of 10 targets, which is certainly a progress
with respect to the previous experiment.

[1] Marc F. Lensink et al (2016). Prediction of homo- and hetero-protein complexes by


protein docking and template-based modeling: a CASP-CAPRI experiment. Proteins.
[2] David W. Ritchie & Sergei Grudinin (2016). "Spherical polar Fourier assembly of protein
complexes with arbitrary point group symmetry". J. Appl. Cryst. 49 (1): 158-167.
[3] D.W. Ritchie & G.J.L. Kemp (2000). Protein Docking Using Spherical Polar Fourier
Correlations, PROTEINS: Struct. Funct. Genet. 39, 178-194.
[4] Petr Popov & Sergei Grudinin (2015). "Knowledge of Native Protein–Protein Interfaces Is
Sufficient To Construct Predictive Models for the Selection of Binding Candidates". J.
Chem. Inf. Mod. 55 (10): 2242-2255.
[5] E. Neveu, D. Ritchie, P. Popov, & S. Grudinin (2016). "PEPSI-Dock: a detailed data-
driven protein–protein interaction potential accelerated by polar Fourier correlation".
Bioinformatics. 32 (17): i693-i701.

255
CAPRI: InterPred

InterPred: A Pipeline To Identify and Model Protein-Protein Interactions

Claudio Mirabello, Björn Wallner


Linköping University, S-581 83 Linköping, Sweden
bjornw@ifm.liu.se

We present InterPred1, a computational pipeline for predicting protein-protein interactions using


structural modelling combined with massive structural comparisons and molecular docking. A
key component of the method is the use of a novel random forest classifier to integrate several
structural features to distinguish correct from incorrect protein interaction models.
In short, the pipeline consists of a first step of model quality assessment of the 3D models for
the target sequences as submitted by the server groups participating in CASP. A second step,
where the PDB is scanned through by looking for structural templates for the interaction.
Suitable templates are then used to build a coarse interaction model for the target structures.
Coarse interaction models are then ranked by our random forest classifier so that the most
promising ones are used as a starting point in the third step of our pipeline, where a full-atom
refinement of the coarse models is performed. The best refined models are then selected and
submitted as final models.

Methods
The server models of the target sequences were downloaded from the CASP webpage and
evaluated by our MQAP method, Pcomb (see CASP12 abstract, Wallner group, for details). The
top ranked models were then used in the structural template search. If no suitable structural
templates are found in the next step of the pipeline, lower ranking models may be considered
until at least one model can be built.
The structural template search was performed by running structural alignments between each
target monomer and the full PDB (May 2016 version) with the TM-align2 program. Whenever
two PDB chains with same PDB id (i.e. from the same complex) are found that are a good match
with the targets (TM-score > 0.5) a coarse interaction model is built by superimposing the two
targets to the template, thereby transferring the positional relationship of the template to the
targets. A set of features is extracted to represent the model and these are evaluated by our
random forest classifier. When more than one model is available, a ranking is performed based
on the confidence measure of their quality. If the target complex is composed of more than two
monomers, a partial coarse model is built with the first two target monomers, then a new
structural template search is run by using the partial coarse model and the third monomer as
inputs, and so on.
Finally, refinement is performed for up to 10 coarse models with Rosettadock. A thousand
decoys are generated this way using each coarse model as a starting point with -dock_pert 5 12
parameters. The decoys are ranked by their IRMSD with the starting points, the top 10 are
selected for submission to CAPRI.

256
Availability
InterPred is available as a stand-alone download free for academic use at:
http://wallnerlab.org/InterPred

1. Mirabello, C. & Wallner, B. InterPred: A pipeline to identify and model protein-protein


interactions. Submitted.
2. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-
score. Nucleic acids research, 33(7):2302–2309, 2005.
3. Gray, J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C.A. & Baker, D.
Protein-protein docking with simultaneous optimization of rigid-body displacement and side-
chain conformations. Journal of Molecular Biology, 331(1):281–299, August 2003.

257
CAPRI: Oliva

Performance of a Pure Consensus Approach for the Scoring of Docking Decoys in CASP12-
CAPRI37

E. Chermak1, L. Cavallo1 and R. Oliva2


1 - Kaust Catalysis Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia; 2 -
Department of Sciences and Technologies, University “Parthenope” of Naples, Napoli.
romina.oliva@uniparthenope.it

Algorithms for scoring protein-protein docking decoys generally rely on physics-based and/or
knowledge-based terms. We introduced in the field the first pure consensus method,
CONSRANK (CONSensus RANKing).1 CONSRANK, also available as a web server,2 ranks
models based on their ability to match the most conserved (or frequent) inter-residue contacts in
the ensemble they belong to.
Blind testing in recent CAPRI Rounds, including the CASP11-CAPRI30 joint
experiment, showed CONSRANK to perform on par with the state-of-the-art energy- and
knowledge-based scoring functions, at least for targets with well-defined interaction interfaces
and docking ensembles sufficiently enriched in correct solutions. In CASP11-CAPRI30, we
could indeed identify native-like solutions for 14 targets, for 11 of them of high or medium
quality (in comparison, the first-ranked scorer group listed 18 successful targets, 14 of which
with high/medium quality solutions).3 Moreover, we achieved the best results overall for 6
targets. CONSRANK performed equally well in the latest CAPRI Rounds 31-35, being the first-
ranked scorer in Round 34.
More recently, we implemented a modified approach, Clust-CONSRANK, embedding a
contact-based clustering step prior to the CONSRANK algorithm. Clust-CONSRANK was
developed in an attempt to overcome one of the intrinsic limitations of a pure consensus
approach, i.e. the propensity to select similar solutions. Testing on past CAPRI targets has shown
the clustering step to substantially enhance the chance of identifying at least one correct solution
for the most challenging targets.4
A combined CONSRANK/Clust-CONSRANK approach has been therefore applied to
the scoring of the CASP12-CAPRI37 targets.

Methods
An initial classification of the targets in easy or difficult was made, based on the availability of
possible templates for the modeling.
Based on the above classification, Clust-CONSRANK was used to rank models for
targets T113-T116, while CONSRANK was used to score models for the remaining targets.
Manual intervention was only used on target T120, to have the dual binding mode of this
complex represented, and, on all the targets, for excluding models with a high number of steric
clashes.

Results
Preliminary assessment results for CASP12-CAPRI37 show that, with our algorithms, we could

258
identify native-like solutions for 8 assessed interfaces (corresponding to 6 targets), on par with
the best performing scorers in this experiment. Moreover, we achieved the highest number of
correct solutions overall (72 vs. 62 of the second scorer in such terms).
However, we submitted medium or high quality solutions for 4 interfaces, while other
scorers succeeded in identifying medium/high quality solutions for up to 6 interfaces. Therefore,
while the assessment of our consensus approach in CASP12-CAPRI37 confirms that it can
perform comparatively with the state-of-the-art energy- and knowledge-based scoring
algorithms, it also points out a possible limitation of the method in singling out correct solutions
of high quality. Our future efforts will be devoted to improve our scoring algorithms in such
terms, i.e. in the ability to identify high quality models.
We remark that consensus approaches are still underrepresented in the field of docking
decoys scoring. Therefore, the consistently good performance of CONSRANK (now combined
with Clust-CONSRANK) in all the recent CAPRI Rounds calls for further investigations to best
exploit the potential of this kind of approaches.

Availability
CONSRANK is available as a web server at: https://www.molnac.unisa.it/BioTools/consrank/. 2

1. Oliva, R., Vangone, A. and Cavallo, L. (2013) Ranking multiple docking solutions based on
the conservation of inter-residue contacts. Proteins 81, 1571-1584.
2. Chermak, E., Petta, A., Serra, L., Vangone, A., Scarano, V., Cavallo, L., Oliva, R. (2015)
CONSRANK: a server for the analysis, comparison and ranking of docking models based on
inter-residue contacts, Bioinformatics 31, 1481-3.
3. Lensink M.F., Velankar S., et al. (2016) Prediction of homo- and hetero-protein complexes by
protein docking and template-based modeling: a CASP-CAPRI experiment, Proteins 84,
Suppl 1:323-48.
4. Chermak, E., De Donato, R., Lensink M.F., Petta, A., Serra, L., Scarano, V., Cavallo L.,
Oliva R. Introducing a Clustering Step in a Consensus Approach for the Scoring of Protein-
Protein Docking Models, under review.

259
CAPRI: Lensink

Modeling protein-protein assemblies: The CASP12-CAPRI challenge

M.F. Lensink1, S. Velankar2 and S.J. Wodak3


1 – University of Lille, CNRS UMR8576 UGSF, F-59000 Lille, France, 2 – European Molecular Biology
Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge
CB10 1SD, United Kingdom, 3 – VIB Structural Biology Research Center, VUB, B-1050 Brussels, Belgium
marc.lensink@univ-lille1.fr

We present the results for CAPRI Round 37, the second joint CASP-CAPRI protein assembly
prediction challenge. The Round comprised a total of 10 targets from among those submitted to
the CASP12 prediction experiment held in the summer of 2016, including 7 homo-oligomers (3
dimers, 3 trimers and 1 octamer) and 3 hetero-oligomers (2 dimers and 1 tetramer). For five of
the homo-oligomeric targets suitable templates were available in the Protein Data Bank, for the
remaining five targets the task was more challenging. For all targets oligomeric state information
was made available to predictors, for the majority of targets this information was derived from
experiment.
Overall 65 different predictor, server and scorer groups submitted in total more than
14000 models, which were assessed against the 3D structures of the corresponding target
complexes. Thirty nine of the 48 groups participating in the Docking experiment and 16 of the 17
groups participating in the Scoring experiment submitted models for at least 7 of the 10 targets.
Unlike in a standard CAPRI Round, only the five top ranking models of the submission were
evaluated, to enable fair comparison between CAPRI and CASP participants, where a
submission of five models is the standard.
The evaluation employed the standard CAPRI evaluation criteria1 to quantify the extent
to which submitted models reproduced the relevant pairwise protein-protein interfaces of each
target (a total of 15 interface for the 10 targets). In total more than 14000 models were assessed
against the 3D structures of the corresponding target complexes.

Results
A good number of predictor and scorer groups were able to produce high-quality models for 4 of
the 7 easy interfaces, including the three different interfaces of the octamer. Two automatic
servers were among the best performing participants in this category of targets; the ROSETTA
server by David Baker ranked top and more interestingly still, the “naive assembly” server by
Chaok Seok, which used off-the-shelf tools for template identification and homology modeling2
ranked highly too, an indication that modeling this category of easy targets is well within reach
of current modeling tools. This was however not the case for the difficult targets, where either
only distantly related templates or none were available. For these, many fewer groups, primarily
CAPRI participants and remarkably one server, the veteran docking server CLUSPRO, submitted
acceptable models.
Details of these results, the issues they raise and the lessons that can be learned, will be
discussed.

260
1. Lensink, M.F. & Wodak, S.J. (2010). Docking and scoring protein interactions: CAPRI
2009. Proteins 78, 3073-3084.
2. Lensink, M.F., Velanka, S. & Wodak, S.J. (1999) Prediction of homo- and hetero-protein
complexes by protein docking and template-based modeling: a CASP-CAPRI
experiment. Proteins 84 S1, 323-348.

261
CAPRI: Takeda-Shitaka_Lab

Prediction of Oligomeric Protein Structures based on Template-Based Modeling


in CAPRI round 37

Yasuomi Kiyota, Yudai Yamamoto, Katsuya Naito and Mayuko Takeda-Shitaka


School of Pharmacy, Kitasato University, Tokyo, Japan
shitakam@pharm.kitasato-u.ac.jp

We participated in the CAPRI round 37 as human group. We predicted both homo- and hetero-
oligomeric protein structures according to the oligomeric state in the CASP12 target list. Our
modeling procedure was based on template-based modeling method.

Methods
Template search
We mainly used HHblits1,2 for template search. In order to find appropriate templates for as
many oligomer targets as possible, we constructed an original template database for oligomer
prediction. In CAPRI round 37, HHblits was run on this database. Symmetry operations were
carried out to generate oligomeric template structures when necessary. For template selection and
alignment, we also used the information of proteins obtained by PSI-BLAST3 and RPS-
BLAST4. In some cases, alignments were refined with manual intervention.

3D-structure construction
We constructed 3D homo- and hetero- oligomeric structures according to the given oligomeric
state in the target list. For all oligomer targets, we constructed oligomeric structures based on the
oligomeric template structures using MODELLER5.
Moreover, for hard targets, we used CASP12 server models (monomer models). The server
monomer models were superimposed onto the oligomer templates using the interfacial residues
of template. Refinement was carried out to remove steric clashes between the interface residues.
Models were ranked based on the structural similarity between monomer models and templates
in the interfacial regions.

1. Remmert,M., Biegert,A., Hauser,A., Söding,J. (2011) HHblits: Lightning-fast iteractive


protein sequence searching by HMM-HMM alignment. Nat Methods. 9(2), 173-175.
2. Söding, J., Biegert, A. & Lupas, A. N. (2005) The HHpred interactive server for protein
homology detection and structure prediction. Nucleic Acids Res. 33, W244-248.
3. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. & Lipman,D.J.
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389-3402.
4. Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH.
(2002) CDD: a database of conserved domain alignments with links to domain three-
dimensional structure. Nucleic Acids Res. 30, 281–283.
5. A. Sali and T.L. Blundell. (1993). Comparative protein modelling by satisfaction of spatial
restraints. J. Mol. Biol. 234, 779-815.

262
CAPRI: Vakser

Modeling CAPRI Targets 110 – 120 by Template-Based and Free Docking

Petras J. Kundrotas, Ivan Anishchenko, Madhurima Das, Taras Dauzhenka, and Ilya A. Vakser
Center for Computational Biology and Department of Molecular Biosciences, University of Kansas, Lawrence, KS
vakser@ku.edu, pkundro@ku.edu

The number of experimentally determined protein structures accounts only for a fraction of
known proteome. Thus, protein docking has to rely primarily on modeled structures of the
individual proteins.1 The joint CASP12-CAPRI round provided a unique opportunity to test our
ability to do that, utilizing CASP-generated models and CAPRI assessment framework.

Methods
For each target, models of individual proteins were selected from 150 CASP Stage 2 models (for
T119 and T120 the selection was made from all available CASP models). The choice was based
on similarity (in terms of our atom-atom contact potential AACE18, and TM-score2) of a CASP
model to the template(s) with the highest score identified by HHSearch program.3,4 For the
template-based docking we used a modification of our previously developed protocol.5,6 The
procedure performs spatial rearrangement of 3D structures of the two target proteins to match the
monomers of the co-crystallized complexes either from the full-structure template library7 or
from a smaller library generated for a particular target utilizing the HHSearch templates.
Structural alignment of the proteins was performed by TM-align.8 Scoring of the docking
predictions was performed by a combination of structure similarity scores, normalized AACE18
values for the interface, fraction of shared target/template contacts, target/template interface
sequence identity, interface solvation score, and the extent of clashes in the unrefined
predictions. The free docking was performed by GRAMM9,10 at lower resolution in order to
accommodate the structural inaccuracies of the modeled monomers. The predicted matches were
scored by the AACE18 potential and minimized by TINKER.11

Results
For each of the eight n-homomeric targets (n = 2 for T114, T115, T116, T119, n = 3 for T110,
T111, T112 and n = 8 for T118), we generated a custom library of templates from PDB identified
by HHSearch as the most plausible templates for the monomers, having the required oligomeric
state in the biological unit. The template-based docking was performed using one or several
CASP models, using these libraries. The docking predictions were evaluated by the combined
scoring function (see Methods). This resulted in 4 high, 10 medium and 11 acceptable-quality
predictions for the trimeric targets; 4 high and 6 medium-quality predictions for the octameric
target; and 2 medium and 5 acceptable quality predictions for the dimeric T119. For the dimeric
targets T114, T115 and T116, there were not enough reliable template-based predictions based on
the custom libraries (combined score < 0.2). Thus we performed additional free docking, scored
by the AACE18 potential. The top free docking matches were included in the final pool of
predictions (targets T114, T115). The best model for T114 had fnat (fraction of predicted native
contacts) 0.27, but poor fnonnat (fraction of non-native contacts) 0.66, and i-RMSD (interface Cα
RMSD of the ligand) 15.6 Å. For this target, there was no consistency in the oligomeric states of

263
the identified templates (results for T115 were not available at the time). T116 had templates
with two distinct interface structures with various ligands at the interface. Thus, we put equal
number of models from both template types to the final prediction pool. The best model for this
target had fnat = 0.40, but fnonnat = 0.82, and i-RMSD 14.5 Å, most likely due to the structure
distortion caused by the interface ligands, not captured by the CASP models.
For the hetero-dimeric target T120, the HHSearch identified 10 PDB structures that are common
for both interacting proteins. Since the majority of the CASP models for these proteins were
similar, all CASP models were docked by the template-based protocol using these 10 templates,
resulting in 3 medium and 5 acceptable quality predictions. For the heterodimeric T113, no
reliable template-based predictions were generated. Thus, all submitted predictions were from
the free docking. The best prediction had fnat = 0.17, but fnonnat = 0.86 and i-RMSD = 7.3 Å, likely
due to poor quality of the CASP models of the smaller protein. For the most difficult T117
(dimer of hetero-dimers), docking was performed by combing truncated PDB structures for the
hetero-dimer with free docking of the heterodimers, and subsequent addition of the missing
residues. Although the submitted predictions were classified as incorrect, the best prediction had
fnat = 0.57, fnonnat = 0.75 and i-RMSD = 9.6 Å.

Availability
The docking procedures and libraries used in this round are partially available from our
DOCKGROUND resource for protein recognition studies at http://dockground.compbio.ku.edu.

[1] Vakser,I.A. Low-resolution structural modeling of protein interactome. Curr. Opin. Struct.
Biol. 2013;23:198–205.
[2] Zhang,Y., Skolnick,J. Scoring function for automated assessment of protein structure
template quality. Proteins. 2004;57:702-10.
[3] Remmert,M., Biegert,A., Hauser,A., Soding,J. HHblits: Lightning-fast iterative protein
sequence searching by HMM-HMM alignment. Nat. Methods. 2012;9:173-5.
[4] Soding,J. Protein homology detection by HMM–HMM comparison. Bioinformatics.
2005;21:951-60.
[5] Sinha,R., Kundrotas,P.J., Vakser,I.A. Docking by structural similarity at protein-protein
interfaces. Proteins. 2010;78:3235-41.
[6] Kundrotas,P.J., Vakser, I.A. Global and local structural similarity in protein-protein
complexes: Implications for template-based docking. Proteins. 2013;81:2137–42.
[7] Anishchenko,I., Kundrotas,P.J., Tuzikov,A.V., Vakser,I.A. Structural templates for
comparative protein docking. Proteins. 2015;83:1563–70.
[8] Zhang,Y., Skolnick,J. TM-align: A protein structure alignment algorithm based on the TM-
score. Nucl. Acid Res. 2005;33:2302-9.
[9] Katchalski-Katzir,E., Shariv,I., Eisenstein,M., Friesem,A.A., Aflalo,C., Vakser,I.A.
Molecular surface recognition: Determination of geometric fit between proteins and their
ligands by correlation techniques. Proc. Natl. Acad. Sci. USA. 1992;89:2195-9.
[10] Vakser,I.A. Protein docking for low-resolution structures. Protein Eng. 1995;8:371-7.
[11] Shi,Y., Xia,Z., Zhang,J., Best,R., Wu,C., Ponder,J.W., et al. The polarizable atomic
multipole-based AMOEBA force field for proteins. J. Chem. Theory Comput. 2013;9:4046-
63.

264
CAPRI: Vakser

Semi-analytical contact potential for proteins and protein complexes

Ivan Anishchenko, Petras J. Kundrotas, and Ilya A. Vakser


Center for Computational Biology and Department of Molecular Biosciences, University of Kansas, Lawrence, KS
vakser@ku.edu, pkundro@ku.edu, anishchenko.ivan@gmail.com

We developed a semi-analytical contact potential, based on q-state Potts model, for docking and
protein structure prediction. Benchmarking showed that this simple contact potential outperforms
most top existing potentials, and captures the structural details explicitly encoded by more
sophisticated potentials.

Methods
For a protein consisting of N atoms/residues, each belonging to one of q types xi  1,2,..., q, the
energy U can be approximated by the generalized q-state Potts model1

U   h  xi    J xi , x j , 
i i~ j

where the first and the second terms


account for one- and two-body
interactions respectively. The two-body
sum runs over all contacting pairs of
atoms/residues ( i ~ j ) that are closer in
space than a cut-off distance dmax and
are separated by at least kmin residues.
Elements of the q-dimensional vector of
local fields h  xi  and q  q matrix of
 
the coupling constants J xi , x j were
determined by maximizing the pseudo-
likelihood function2 constructed for a
Figure 1. Performance of different potentials in best non-redundant training set of 6,338
model structure recognition for CASP decoys. Average individual protein chains selected by
best model Z-score and average Pearson's correlation PISCES server.3 The optimal values of
coefficient of the energy score with GDT_TS score are parameters dmax and kmin were
shown for different scoring functions. determined in benchmarking. Docking
was performed by GRAMM4,5 at low resolution with 3.5 Å grid step and 10° angular intervals.

Results
We tested residue-residue potential with q = 20 (RRCE20) and atom-atom potentials with q = 18
(AACE18), 20 (AACE20) and 167 (AACE167) on CASP10 and CASP11 decoys6,7 and on a set
of free docking decoys for the unbound structures of 396 protein complexes (DOCKGROUND
benchmark 4.0). Comparison with other statistical potentials showed that our potentials are
among the top performing ones, with only sophisticated, orientation-dependent GOAP potential8

265
outperforming our AACE18 and AACE167
potentials (Fig. 1). The results show that
these simple contact potentials capture
most structural details that much more
complex functions encode explicitly. For
protein docking, all our potentials
outperformed MJ3h potential9 (Fig. 2),
which in our earlier tests showed best
performance in protein docking among
potentials developed for individual
proteins. Interestingly, the use of 18 atom
Figure 2. Performance of the potentials in protein types assigned according to
10
docking. Dashed and solid lines show success rates physicochemical properties yielded
for free docking by shape complementarity only, potential with the performance comparable
and scored by the MJ3h potential, respectively. to the much more complex AACE167
potential (210 distinct couplings in
AACE18 vs. 14,028 in AACE167).

Availability
The docking procedures and datasets used in this study are partially available from our
DOCKGROUND resource for protein recognition studies at http://dockground.compbio.ku.edu.

[1] Wu,F.Y. The Potts model. Rev. Mod. Phys. 1982;54:235.


[2] Gidas,B. Consistency of maximum likelihood and pseudo-likelihood estimators for Gibbs
distributions. In: Fleming,W., Lions,P.L., editors. Stochastic Differential Systems, Stochastic
Control Theory and Applications. New York: Springer; 1988. p. 129-45.
[3] Wang,G, Dunbrack,R.L. PISCES: Recent improvements to a PDB sequence culling server.
Nucl. Acids Res. 2005;33:W94-W8.
[4] Katchalski-Katzir,E., Shariv,I., Eisenstein,M., Friesem,A.A., Aflalo,C., Vakser,I.A.
Molecular surface recognition: Determination of geometric fit between proteins and their
ligands by correlation techniques. Proc. Natl. Acad. Sci. USA. 1992;89:2195-9.
[5] Vakser,I.A. Protein docking for low-resolution structures. Protein Eng. 1995;8:371-7.
[6] Moult,J., Fidelis,K., Kryshtafovych,A., Schwede,T., Tramontano,A. Critical assessment of
methods of protein structure prediction (CASP) — round X. Proteins. 2014;82(Suppl 2):1-6.
[7] Moult,J., Fidelis,K., Kryshtafovych,A., Schwede,T., Tramontano,A. Critical assessment of
methods of protein structure prediction: Progress and new directions in round XI. Proteins.
2016;84 Suppl.1:4-14.
[8] Zhou,H., Skolnick,J. GOAP: A generalized orientation-dependent, all-atom statistical
potential for protein structure prediction. Biophys. J. 2011;101:2043-52.
[9] Miyazawa,S., Jernigan,R.L. Self-consistent estimation of inter-residue protein contact
energies based on an equilibrium mixture approximation of residues. Proteins. 1999;34:49-
68.
[10] Zhang,C., Vasmatzis,G., Cornette,J.L., DeLisi,C. Determination of atomic desolvation
energies from the structures of crystallized proteins. J. Mol. Biol. 1997;267:707-26.

266
CAPRI: Venclovas

Comparative Modeling of Protein Complexes in CAPRI Round 37

J. Dapkūnas1, K. Olechnovič1,2 and Č. Venclovas1


1 - Institute of Biotechnology, Vilnius University, 2 - Faculty of Mathematics and Informatics, Vilnius University
justas.dapkunas@bti.vu.lt, venclovas@ibt.lt

We participated in CAPRI Round 37 aiming to test the utility of our PPI3D web server in finding
templates for comparative modeling of structures of protein complexes and the ability to use our
model accuracy estimation method VoroMQA for scoring and ranking structural models of
protein-protein interactions.

Methods
We used comparative protein structure modeling to predict the structures of the CAPRI Round
37 targets. The initial search for templates was performed by the PPI3D web server1. Given the
protein sequences, PPI3D enables searches for binary protein-protein interaction data of
homologous proteins in the Protein Data Bank (PDB). PPI3D sequence searches are performed
using either BLAST or PSI-BLAST methods. The initial structural models of protein dimers
were generated using PPI3D from the results of the target sequence searches. If necessary, the
structures were subsequently remodeled to represent higher oligomeric states.

The PPI3D web server is designed for the search and analysis of structural data on
protein interactions. It provides the homology models only as an additional feature, and therefore
no model refinement and optimization techniques are implemented. To improve the initial
structural models we applied additional steps including sequence alignment and structure
refinement. Thus, after the initial search and modeling by PPI3D, additional sequence-structure
alignments were generated for the same templates using the HHpred server available from the
MPI Bioinformatics Toolkit2. All the alignments were utilized to create structural models for the
target protein complex using MODELLER4. Previous CASP-CAPRI experiment showed that
higher quality models of individual subunits usually lead to higher quality docking models of
protein complexes4. In order to test this feature for homology modeling of interaction interfaces,
for each target we selected one or two models having the best VoroMQA scores from the CASP
server models. The protein complex structure was then modeled using these CASP server models
as templates for the individual subunits and the templates from PPI3D and HHpred results for the
protein-protein interface. Models obtained using reliable structural templates were further refined
by applying fragment-guided molecular dynamics implemented in the FG-MD server5.

The constructed models for protein complexes were ranked using our recently developed
method VoroMQA ("Voronoi tessellation-based Model Quality Assessment")6. It combines the
idea of knowledge-based statistical potentials with the advanced use of the Voronoi tessellation
of atomic balls. VoroMQA uses contact areas instead of distances for describing and seamlessly
integrating both the explicit interactions between protein atoms and the implicit interactions of
protein atoms with solvent. It produces scores at atomic, residue and global levels, all in the
fixed range from 0 to 1. As a result, VoroMQA can also derive an interface quality score from the

267
local scores of the atoms that participate in inter-chain contacts. We used both whole-structure
and interface-only VoroMQA scores to rank our models prior to submission.

The same ranking procedure was also employed for selecting the best models from the
sets provided for the CAPRI scorers. However, these sets often contained models with a large
number of steric clashes. Such models were filtered out before applying VoroMQA.

Results
The templates could be straightforwardly identified for 7 of 11 targets of CAPRI Round 37
(T110 (CASP T0860), T111 (T0867), T112 (T0881), T116 (T0893), T118 (T0906), T119
(T0917), and T120 (T0921-T0922)). Models for these targets were generated as described above.
Two targets had only low reliability templates (T114 (T0875) and T115 (T0877)). For example,
only monomeric templates could be identified for the T115, thus the dimeric structure was
modeled using their largest crystal interfaces downloaded from the PDBePISA server7. No
templates could be found for the hetero-interaction interfaces of T113 (T0884-T0885) and T117
(T0903-T0904). Models for these targets were constructed using CASP server models.

Availability
The PPI3D web server is available at http://bioinformatics.ibt.lt/ppi3d/. The VoroMQA web
application is available at http://bioinformatics.ibt.lt/wtsam/voromqa. VoroMQA software for
Linux is included in the Voronota package freely available from
http://bitbucket.org/kliment/voronota/downloads.

1. Dapkūnas,J. et al. (2016). The PPI3D web server for searching, analyzing and modeling
protein-protein interactions in the context of 3D structures. Submitted.
2. Alva,V., Nam,S.-Z., Söding,J. & Lupas,A.N. (2016). The MPI bioinformatics Toolkit as an
integrative platform for advanced protein sequence and structure analysis. Nucleic Acids Res.
44, W410-415.
3. Šali,A. & Blundell,T.L. (1993). Comparative Protein Modelling by Satisfaction of Spatial
Restraints. J. Mol. Biol. 234, 779–815.
4. Lensink,M.F. et al. (2016). Prediction of homoprotein and heteroprotein complexes by
protein docking and template-based modeling: A CASP-CAPRI experiment. Proteins 84
Suppl 1, 323–348.
5. Zhang,J., Liang,Y. & Zhang,Y. (2011). Atomic-level protein structure refinement using
fragment-guided molecular dynamics conformation sampling. Structure 19, 1784–1795.
6. Olechnovič,K. & Venclovas,Č. (2016). VoroMQA: quality assessment of single protein
structure models using inter-atom contact areas derived from the Voronoi tessellation of
atomic balls. Manuscript in preparation.
7. Krissinel,E. & Henrick,K. (2007). Inference of macromolecular assemblies from crystalline
state. J. Mol. Biol. 372, 774–797.

268
CAPRI: ZOU_GROUP

The Quality of Monomeric Structure is Important to the Docking-based Protein Complex


Structure Prediction

Xianjin Xu1†, Liming Qiu1†, Rui Duan1†, Jie Hou2, Tyler W. Ball3, Jianlin Cheng2,4,
Xiaoqin Zou1,3,4,5*
1 - Dalton Cardiovascular Research Center, 2 - Department of Computer Science, 3 - Department of Biochemistry, 4
- Informatics Institute, 5 - Department of Physics and Astronomy, University of Missouri, Columbia, MO 65211,
USA, † XX, LQ, and RD contributed equally to this work.
*
zoux@missouri.edu

Protein-protein docking is a useful tool to predict protein complex structures formed by two or
more monomeric proteins, based on the individual protein structures1-3. However, it is
commonplace in practical studies that structural information is incomplete or totally missing.
Hence, docking with monomeric structures built from sequence data is becoming increasingly
important, and the success in modeling the monomeric structures is pivotal to the success of
docking. In view of this practical issue, two independent methods, ITScorePro4 and
MULTICOM5-6, were used to enhance the quality of the predicted monomeric structures.
ITScorePro is a statistical potential-based scoring function that was developed in the Zou Lab to
overcome the reference state challenge for ranking the predicted monomeric structures according
to the calculated free energy scores. It has been systematic tested for single protein structure
prediction. MULTICOM is a robust and consistent quality assessment (QA) method developed in
the Cheng Lab that integrated multiple complementary QA methods and achieved excellent
performance in the CASP11 experiment. Up to 10 monomeric structures with consensus high
rankings were selected as the input to an in-house docking program MDockPP7-8 to predict
protein complex structures. This approach was applied to the recent CASP12-CAPRI37 joint
experiment.

Methods
Starting from a given amino acid sequence, BLAST9 was used to search for homologous proteins
with the 3D structures deposited in the Protein Data Bank, in order to generate monomeric
structures through homology modeling. If homologous proteins were found, MODELLER10 was
employed to construct the corresponding monomeric structures. In the cases where a complex
homologous to the target complex was recovered, MODELLER was also used to directly build
the target complex structure. The modeled complex structure was regarded as one of the putative
binding modes as those produced by docking. On the other hand, if no homologous proteins were
identified, a template-free modeling approach implemented on the CASP11 MULTICOM-
CLUSTER server11 was relied on to generate model structures for the monomers. In addition, the
CASP groups provided a total of 150 best monomeric structures for each target as the references
for the CAPRI groups. For quality control and selection, these candidate monomeric structures
were evaluated by two independent methods, ITScorePro4 and MULTICOM5-6. ITScorePro is a
scoring function that the Zou Lab developed for protein structure selection using an iterative,
statistical mechanics-based approach to circumvent the reference-state problem in statistical
potentials. The other method, MULTICOM from the Cheng Lab, integrates 14 quality

269
assessment (QA) methods (including both single and consensus QA methods) to generate model
quality scores. Finally, up to 10 model structures with consensus high rankings were selected for
docking.
A hierarchical, automated protocol, MDockPP, was used to predict protein complex
structures based on these selected monomer structures. Briefly, the FFT-based rigid docking
algorithm, ZDOCK 3.012, was used to generate putative binding modes. For the targets with
symmetry information, M-ZDOCK13 was employed to generate putative binding modes. The
interval of the Euler angles was set to 6° and 3000 putative binding modes were generated for
each docking. Then, the created binding modes were optimized and scored by an atomic-level,
statistical potential-based scoring function for protein-protein interactions, ITScorePP14, which
was developed by our group using an efficient iterative method based on the crystal structures of
heterodimeric protein complexes. Available biological information would be used to filter out
incompliant modes in this step. Afterwards, the putative binding modes were ranked and then
clustered based on the backbone root-mean-square deviation (b-RMSD) of the complexes. For
any two binding modes with a b-RMSD less than a cutoff value (5 Å), only the one with the
better score was kept. Ten models were selected for submission.

1. Wodak,S.J., Janin,J. (1978) Computer analysis of protein–protein interaction. J. Mol. Biol.


124,323-342.
2. Smith,G.R., Sternberg,M.J. (2002) Prediction of protein–protein interactions by docking
methods. Curr. Opin . Struct. Biol. 12,28-35.
3. Halperin,I., Ma,B., Wolfson,H., Nussinov,R. (2002) Principles of docking: an overview of
search algorithms and a guide to scoring functions. Proteins 47,409-443.
4. Huang,S.Y., Zou,X. (2014) ITScorePro: An efficient scoring program for evaluating the
energy scores of protein structures for structure prediction. Methods Mol. Biol. 1137,71-81.
5. Cao,R., Bhattacharya,D., Adhikari,B., Li,J., Cheng,J. (2015) Massive integration of diverse
protein quality assessment methods to improve template based modeling in CASP11.
Proteins (in press).
6. Cao,R., Bhattacharya,D., Adhikari,B., Li,J. & Cheng,J. (2015) Large-scale model quality
assessment for improving protein tertiary structure prediction. Bioinformatics 31,i116-i123.
7. Huang,S.Y., Zou,X. (2010) MDockPP: A hierarchical approach for protein-protein docking
and its application to CAPRI rounds 15-19. Proteins 78,3096-3103.
8. Huang,S.Y., Yan,C., Grinter,S.Z., Chang,S., Jiang,L., Zou,X. (2013) Inclusion of the
orientational entropic effect and low‐resolution experimental information for protein–protein
docking in Critical Assessment of PRedicted Interactions (CAPRI). Proteins 81,2183-2191.
9. Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. & Lipman, D.J.
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389-3402.
10. Marti-Renom,M.A., Stuart,A., Fiser,A., Sanchez,R., Melo,F., Sali,A. (2000) Comparative
protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29,
291-325.
11. Li, J., Deng, X., Eickholt, J. & Cheng, J. (2013) Designing and Benchmarking the
MULTICOM Protein Structure Prediction System. BMC Struct. Biol. 13,1.
12. Chen,R., Li,L. & Weng,Z. (2013) ZDOCK: an initial-stage protein-docking algorithm.
Proteins 52,80–87.

270
13. Pierce,B., Tong,W., Weng,Z., (2005) M-ZDOCK: a grid-based approach for Cn symmetric
multimer docking. Bioinformatics, 21,1472-1478.
14. Huang,S.Y. & Zou,X. (2008) An iterative knowledge-based scoring function for protein–
protein recognition. Proteins 72,557-579.

271
CASP related: CAMEO

CAMEO - Continuous Automated Model EvaluatiOn

Juergen Haas, Alessandro Barbato, Dario Behringer, Steven Roth, Martino Bertoni,
Khaled Mostaguir, Andrew Waterhouse, Stefan Bienert, Tobias Schmidt, Torsten Schwede
SIB Swiss Institute of Bioinformatics/Biozentrum, University of Basel, Klingelbergstr 50/70, 4056 Basel

CAMEO - Continuous Automated Model EvaluatiOn (http://www.cameo3d.org) - continuously


assesses the performance of servers predicting protein structures. CAMEO (1) is based on the
weekly pre-release of amino acid sequences and ligand identity for experimental protein
structures which will be part of the upcoming PDB release. Each Saturday, the sequences are
sent to the participating servers for CAMEO 3D category and predictions are collected during the
next 4 days. Once the coordinates of are released by the PDB on Wednesday, these are used as
reference for scoring the predictions, and the assessments are then published on cameo3d.org for
in-depth performance analyses.
The assessment of prediction servers based on the PDB prerelease benefits from a large
number targets: 4,242 targets were evaluated during the last 245 weeks in the "CAMEO 3D"
category, i.e. each month on average ~ 70 targets. The target set is diverse - consisting of 1,176
hard (lDDT<50), 2,180 medium and 884 easy targets (lDDT >=75) (2). 1'867 targets exhibit
homo-oligomeric structure - allowing to assess the ability of servers to correctly predict the
oligomeric assembly state of a target protein.
For evaluating the performance of Quality Estimation tools assigning local confidence
measures to models, the "CAMEO QE" category relies on 3D coordinates, which are generated
by the participating public 3D modeling servers. These models are then sent to the CAMEO QE
participants for local quality estimation. In this category, 16119 structural models were assessed
during the last 135 weeks.
The aim of CAMEO is to support various scientific communities by providing a spectrum
of different scores, thereby covering various aspects of modeling. CAMEO supports the
developers of prediction servers, rapidly assessing new developments anonymously and
monitoring the performance of their public productive servers continuously. CAMEO allows life
scientists to understand which public modeling server is the most suited for their specific use
case. And CAMEO stimulates the respective communities in discussing new scores, thereby
covering yet another aspect of the respective field. We hence invite the community to discuss
CAMEO scoring schemes and future developments to best reflect the different requirements and
recent scientific developments.
New developments: CAMEO aims to address the evaluation requirements of the
structural bioinformatics community by introducing new categories and scores as required. The
new category “Contact Prediction” (CP) is assessing residue-residue contact predictions -
reflecting recent findings, that the quality of a model can be improved greatly by considering
predicted residue-residue contacts in the modeling process (3). This applies in particular for
target proteins larger than 250 amino acid residues, where no structural templates are available.
Other categories that have been requested by the community include Secondary Structure
Predictions and solvent accessibility.
In order to support modelling servers aiming to predict quaternary models including bound

272
ligands (e.g. co-factors), upcoming releases of CAMEO will include evaluation of ligand
conformation in models (provided as InChIs with the target sequence at submission time), as
well as extend the assessment of quaternary structure from currently homo-oligomers to hetero-
oligomers and complexes. During the last 245 weeks, ligands were contained in 1,758 structures,
and the RCSB PDB released 8,938 hetero-oligomers and complexes, indicating that sufficient
data is available for an automated analysis.
In an effort to make benchmarking processes available to other communities, the workflow of
CAMEO is currently being generalized in the context of ELIXIR EXCELERATE framework to
apply the concepts of continuous evaluation to other scientific communities.

1. Haas, J. et.al. Database 2013, 10.1093/database/bat031.


2. Mariani, V. et.al. Bioinformatics. 2013, 29(21):2722-8.
3. Monastyrskyy B. et.al. Proteins 2016, 10.1002/prot.24943.

273
CASP related: Quaternary structure prediction

Modeling of protein quaternary structure of homo- and hetero-oligomers beyond binary


interactions

Martino Bertoni, Florian Kiefer, Marco Biasini, Lorenza Bordoli, Torsten Schwede
Swiss Institute of Bioinformatics/Biozentrum, University of Basel, Klingelbergstr 50/70, 4056 Basel

Cellular processes often depend crucially on interactions between proteins and the formation of
macromolecular complexes. The impairment of such interactions can lead to deregulation of
pathways resulting in disease states, and it is hence critical to gain insights into the nature of the
macromolecular assemblies. Structural knowledge about complexes and protein-protein
interactions is growing, but experimentally determined three-dimensional multimeric assemblies
are outnumbered by complexes supported by non-structural experimental evidence. Here, we aim
to fill this gap by modeling complete multimeric structures by homology, and we ask which
properties of a protein family can assist in the prediction of a correct quaternary structure.
Specifically, we introduce a description of protein-protein interface conservation as a function of
evolutionary distance, which enables us to reduce the noise coming from multiple sequence
alignments where proteins’ oligomeric conformations are neglected. We also define a distance
measure to structurally compare homologous multimeric protein complexes. This allows us to
hierarchically cluster homologs of known structure and quantify the diversity of alternative
biological assemblies available in public databases. We find that the combination of conservation
and structural clustering features, together with classical interface descriptors, is able to improve
the selection of homologous proteins leading to reliable models of protein complexes.

274
CASP related: SAXS results for the data-assisted category

Small Angle X-ray Scattering provides valuable structural information for Protein
Structure Prediction

S.E. Tsutakawa1, G.L. Hura1, J.A. Tainer1,2


1 – Lawrence Berkeley National Laboratory, 2 – MD Anderson Cancer Center
setsutakawa@lbl.gov

A great challenge in protein structure prediction is distinguishing the accuracy of one model
versus another. Experimental input can facilitate identification of the correct model. Small
Angle X-ray Scattering (SAXS), a solution-based technique to study protein structure, is an ideal
input for prediction algorithms. Like macromolecular crystallography (MX), SAXS relies on
coherent scattering from electron pairs. Every electron (e-) pair distance is represented in the X-
ray scattering intensities, with longer distances represented at the low momentum transfer or q
(scattering angle) and shorter distances at the higher q. Although SAXS is traditionally
considered low-resolution, protein structure prediction can exploit the accurate structural
information measured in regions of the SAXS data in the higher q range. Our SAXS metrics can
be applied to accurately rank models on similarity to the target structure. SAXS can validate
correct atomic models, eliminate incorrect models, and provide directionality to the trajectories
in structure prediction algorithms.

Availability
Website: http://sibyls.als.lbl.gov/

275

You might also like