You are on page 1of 17

Biochemistry 1998, 37, 14719-14735

14719

Thermodynamic Parameters for an Expanded Nearest-Neighbor Model for Formation of RNA Duplexes with Watson-Crick Base Pairs
Tianbing Xia, John SantaLucia, Jr., Mark E. Burkard, Ryszard Kierzek,| Susan J. Schroeder, Xiaoqi Jiao, Christopher Cox, and Douglas H. Turner*,
Department of Chemistry, UniVersity of Rochester, Rochester, New York 14627-0216, Department of Chemistry, Wayne State UniVersity, Detroit, Michigan 48202, Institute of Bioorganic Chemistry, Polish Academy of Sciences, 60-704 Poznan, Noskowskiego 12/14, Poland, and Department of Biostatistics, School of Medicine, UniVersity of Rochester, Rochester, New York 14642 ReceiVed April 27, 1998; ReVised Manuscript ReceiVed August 10, 1998

ABSTRACT: Improved thermodynamic parameters for prediction of RNA duplex formation are derived from optical melting studies of 90 oligoribonucleotide duplexes containing only Watson-Crick base pairs. To test end or base composition effects, new sets of duplexes are included that have identical nearest neighbors, but different base compositions and therefore different ends. Duplexes with terminal GC pairs are more stable than duplexes with the same nearest neighbors but terminal AU pairs. Penalizing terminal AU base pairs by 0.45 kcal/mol relative to terminal GC base pairs significantly improves predictions of G37 from a nearest-neighbor model. A physical model is suggested in which the differential treatment of AU and GC ends accounts for the dependence of the total number of Watson-Crick hydrogen bonds on the base composition of a duplex. On average, the new parameters predict G37, H, S, and TM within 3.2%, 6.0%, 6.8%, and 1.3 C, respectively. These predictions are within the limit of the model, based on experimental results for duplexes predicted to have identical thermodynamic parameters.

The thermodynamics of secondary structure formation are important for unraveling structure-function relationships for RNA. For example, these thermodynamics provide a foundation for predicting secondary structure and stability, both of which can correlate with function. Moreover, predicting secondary structure is a crucial intermediate step toward predicting three-dimensional structure (1, 2). In addition, differences between the thermodynamics of secondary structure formation and of overall folding can provide insight into the thermodynamics of tertiary structure formation (3-7). Watson-Crick base pairs are one of the most important motifs in RNA secondary structures. The thermodynamics of Watson-Crick base pair formation have been studied in short RNA duplexes (8, 9). The results are well-represented by a nearest-neighbor model in which the thermodynamic stability of a base pair is dependent on the identity of the adjacent base pairs. This model has been termed an individual nearest-neighbor (INN) model (10, 11). The pioneering implementation by Borer et al. (8) employed 6 nearest-neighbor parameters and separate initiation parameters for duplexes with and without a GC base pair. Due to advances in oligoribonucleotide synthesis (12), Freier et al. (9) were able to determine all 10 nearest-neighbor parameters in the INN model and the initiation parameter for duplexes

This work was supported by NIH Grant GM 22939. * To whom correspondence should be addressed. Department of Chemistry, University of Rochester. Wayne State University. | Polish Academy of Sciences. Department of Biostatistics, University of Rochester.

with at least one GC base pair. The initiation parameter for duplexes with only AU base pairs was not determined. It has been suggested that a nearest-neighbor model that treats terminal base pairs differently from internal base pairs (8) or treats terminal GC base pairs differently from terminal AU base pairs (10, 11, 13) may improve modeling of duplex stability. The model proposed by Gray (10) has been termed an independent short sequence (ISS) model because the 14 sequence-dependent parameters of the model must be combined into 12 short sequence parameters due to constraints on the number of independent equations available from experimental data (10, 11, 13, 14). This combination of parameters makes it arbitrary to assign values to the individual nearest neighbors other than the two likeGG3 5AA3 neighbors, 5 3CC5 and 3UU5, but this does not hinder predictions of duplex stability. An INN model that is statistically equivalent to the ISS model was recently used to analyze stabilities of 108 DNA duplexes (15, 16). In this model, the initiation term is dependent on the identities of the two terminal base pairs. At 37 C, the free-energy changes for initiation of DNA duplexes with any combination of GC and AT ends were identical within experimental error, so that a single initiation parameter model fits the duplex stabilities (G37) as well as the two-initiation parameter model. In an analysis of 45 RNA duplexes, inclusion of an end effect improved predictions of duplex stability by 0.1 kcal/mol or about 1% on average (S. M. Freier and D. H. Turner, unpublished results). This marginal effect was omitted from the final analysis (9). None of the nearest-neighbor models is exact because an average difference of roughly 6% is observed for experi-

10.1021/bi9809425 CCC: $15.00 1998 American Chemical Society Published on Web 10/01/1998

14720 Biochemistry, Vol. 37, No. 42, 1998 mentally determined free-energy changes, G37, for formation of RNA duplexes with identical nearest neighbors and identical ends (12). Similar results have also been observed with DNA duplexes (17) and with RNA/DNA hybrids (18). The database of RNA duplexes has expanded considerably since 1986, and commercial availability of RNA synthesis reagents now facilitates testing of hypotheses. This paper thus explores several nearest-neighbor model analyses of thermodynamic parameters for a set of 90 RNA duplexes. It is found that an INN model with a 0.45 kcal/mol free-energy penalty for each terminal AU base pair provides an excellent fit to the data. The results are consistent with a physical model in which this term arises from the base composition dependence of the number of hydrogen bonds in the duplex, and the nearest-neighbor parameters reflect both hydrogen bonding and stacking within specific nearest-neighbor pairs. MATERIALS AND METHODS Oligonucleotide Synthesis and Purification. RNA oligonucleotides were synthesized on an Applied Biosystems 392 DNA/RNA synthesizer from monomers with 2-hydroxyls protected as the tert-butyl-dimethylsilyl ether and 5-hydroxyls protected as the dimethoxytrityl group (19-22). Upon completion of coupling on the synthesizer, the oligonucleotides were removed from solid support and deprotected by treatment with concentrated ammonia in ethanol (3:1 by volume) at 55 C overnight (23). The 2-silyl protection was removed by treatment with freshly made 1.0 M triethylammonium fluoride (50 equiv) in pyridine at 55 C for 48 h. The crude samples were dried and partitioned between water and diethyl ether. Then the samples were desalted through a Sep-pak C-18 cartridge (Waters) and purified on a Si500F thin-layer chromatography plate (Baker) developed with 1-propanol/ammonia/water (55:35:10 by volume) (24). The least mobile product band was visualized with ultraviolet light, cut out, and eluted with water. The samples were desalted again with a Sep-pak C-18 cartridge. The purity of each oligonucleotide was checked by HPLC on a C-8, C-18, or ABZ analytical reverse-phase column (Hamilton) and was greater than 95%. UV Melting. Thermodynamic parameters were measured in 1.0 M NaCl, 10 or 20 mM sodium cacodylate, and 0.5 mM Na2EDTA, pH 7.0. Oligoribonucleotide single-strand concentrations, CT, were calculated from high-temperature (>80 C) absorbances and single-strand extinction coefficients approximated by a nearest-neighbor model as described previously (25, 26). Single strands for forming non-self-complementary duplexes were mixed in 1:1 concentration ratios. Small errors in mixing ratios are not expected to affect the results since the effect of the mixing ratio not being 1:1 was found to be small up to 50% excess of one strand (27). Absorbance versus temperature melting curves were measured at 260 or 280 nm with a heating rate of 1 C/min from 0 to 90 C on a Gilford 250 spectrometer controlled by a Gilford 2527 thermoprogrammer. Melting curves were measured over a 60-150-fold range in total oligonucleotide concentration. Thermodynamic Analysis of Duplex Formation. Thermodynamic parameters for duplex formation were derived by two vant Hoff analysis methods. For the first, the shape of each melting curve was fit to the two-state model with linear

Xia et al. sloping baselines (28) using a nonlinear least-squares program (29, 30):

R)

ss(T) ss(T)

- (T)
ds(T)

(mssT + bss) - (T) (mssT + bss) - (mdsT + bds)

(1)

where (T) is the measured extinction coefficient at any temperature, ss is the average extinction coefficient for the single-stranded state, and ds is the extinction coefficient per strand for the double-stranded state. The fraction of total single strand in duplex, R, as a function of temperature is related to H and S through the equilibrium constant K:

K ) exp(-G(T)/RT) ) exp -

H S + ) RT R R (2) 2(CT/a)(1 - R)2

where R is the gas constant, 1.987 cal K-1 mol-1 and a is 1 for self-complementary duplexes and 4 for non-selfcomplementary duplexes (31). CT is the total concentration of single strands. Thermodynamic parameters obtained by fitting melting curves to the parameters mds, bds, mss, bss, H, and S were averaged over melting curves measured at different concentrations. The second vant Hoff analysis involved plotting the reciprocal of the melting temperatures (in K), TM-1, versus ln(CT/a) (8):

TM-1 )

R S ln(CT/a) + H H

(3)

The data were taken as consistent with the transition proceeding in a two-state manner if the H values calculated by the two methods agreed within 15% (1, 9, 28, 29). The average of differences in H between the two analysis methods in our RNA database of 90 two-state sequences is 5.8%. The differences in S and G37 are 6.5% and 1.8%, respectively. The thermodynamic parameters obtained from the 1/TM vs ln(CT/a) plots were used in regression analysis for the nearest-neighbor models. Error Analysis. Instrumental fluctuations contribute negligibly to the uncertainties of thermodynamic parameters. For example, 0.5% fluctuation of absorbance leads to about 0.5% uncertainty in the equilibrium constant and less than 0.005 kcal/mol absolute uncertainty in G37. Both vant Hoff analysis methods provide means of estimating the sample variances, 2, of thermodynamic parameters and the sample covariances between them. Assuming measured values are distributed normally and variations are larger than the actual temperature dependence of the parameters, sample variances of parameters can be estimated from variation of parameters determined from melting curves measured at different oligonucleotide concentrations. For example, for H,

H2 )

(Hi - H)2 N-1


i)1

(4)

where N is number of different concentrations studied, and H is the average of the measured His. Similarly, variances for S and G37 can be estimated. This

RNA Nearest-Neighbor Parameters estimation of variances should be considered a rough estimation because the sample size N is small, typically about 8-12, and because there is probably a small temperature dependence to both H and S. For 51 duplexes where the required data were available, the average relative sample standard deviations, , estimated this way are 6.5%, 7.3%, and 2.4% for H, S, and G37, respectively. Variance estimates can also be obtained from the uncertainties in the slope and intercept of each 1/TM vs ln(CT/a) straight line with all data points weighted equally and with errors in CT negligible compared to error in 1/TM (32-34):

Biochemistry, Vol. 37, No. 42, 1998 14721

Cov(H,S) )

1 N-1

[(Hi - H)(Si i

S)] (8)
The observed correlation coefficients between H and S, r, calculated by

r)
or (37)

Cov(H,S) HS

(9)

H2 ) (H)2

m2 m2 2Cov(m,b) mb

(5)

r)

S ) (S)
2

G(T)2 ) m2

( ) ()

b2 b
2

m 2 m
2

(1/TM) N (1/TM)2

(10)

(6)

Tb - 1 2 T2 + b2 2 2 m m 2TCov(m,b) (Tb - 1) m3 (7)

where T is the temperature in Kelvin; m, b, m2, b2, and Cov(m,b) are the slope, intercept, variance of slope, variance of intercept, and covariance of slope and intercept, respectively. There are standard formulas for calculating these quantities from linear regression (35, 36). For 51 duplexes where the required data were available, the average relative sample standard deviations from this estimate are 2.9%, 3.3%, and 0.96% for H, S, and G37, respectively, which are about half as large as the discrepancies between parameters from the two analysis methods. Note that they are much smaller than estimated from averages of fittings. This is because the standard deviations estimated from the averages of fittings (eq 4) reflect estimates of the variance for the population, that is, how much uncertainty a single measurement is subject to, while eqs 5-7 give the variance of the parameters themselves, that is, how precisely one can determine the average values from certain numbers of repetitions of measurements. Thus, the latter are smaller by a factor of the square root of the number of degrees of freedom (36). Because we usually measure 9 concentrations and derive two parameters from the 1/TM vs ln(CT/a) fit, there are 9 - 2 ) 7 degrees of freedom, roughly consistent with the observations on the relative magnitudes of the error estimates from the two methods. For the two error analysis methods, the relative uncertainty in S is about 13% larger than that in H. This is because the value of S depends on more terms than H. The error in H is proportional to the uncertainty in the slope of the melting curve or standard deviation of the slope of 1/TM vs ln(CT) plot, while error in S depends on the uncertainties in the slope and TM of the melting curve, or standard deviations of slope, intercept, and the covariance of the slope and intercept of the 1/TM vs ln(CT) plot. The sample covariance of H and S, Cov(H, S), can be calculated from the individual pairs of values of H and S from concentration-dependent melting studies as follows (35),

have an average value of 0.9996 for duplexes in the RNA database, indicating compensating effects between errors in H and S. This is probably because the range of TM is relatively narrow, that is, the chemical variations between measurements (e.g., changes in total strand concentration) and therefore the changes of parameters with temperature are small compared to the statistical uncertainties in H and S; therefore correlation coefficients approach unity (37). Due to this fact, G and TM are more accurate parameters than either H or S individually (15, 29, 33, 34). For free-energy changes, G(T) ) H - TS, the uncertainties in H and S are propagated into uncertainties in G(T) by the following equation (38):

G(T)2 ) H2 + T2S2 - 2TCov(H,S) ) H2 + T2S2 - 2TrHS (11)

If r is unity, G(T)2 ) (H - TS)2, which states that the error in G(T) is the balance between the errors in H and TS just like G(T) is the balance between H and TS. The temperature at which G values are most accurate can be calculated by setting the first derivative of G(T)2 with respect to temperature, T, to 0, which gives Tmin ) r(H/S). At this temperature, G(T)2 ) (1 - r2)H2. This temperature is close to TM for many of the sequences in the RNA database. This is because the relative error in S is usually about 13% larger than the relative error in H, so for the range of CT commonly employed, 10-510-4 M, this ratio r(H/S) rH/1.13S is close to TM for the duplex since TM ) H/[S + R ln(CT/a)]. Because human body temperature is the most common temperature for applications of G values, many sequences were designed to have melting temperatures near 37 C at a CT of about 10-4 M. On the basis of error propagation for eq 3, the uncertainty in TM can be estimated by the following:

TM )
2

TM2 (H)

H2 + TM2S2 + R2TM2

1 C 2 (CT)2 T

2TMCov(H,S) (12)
where it is assumed that neither H nor S is correlated with CT. If CT is determined accurately by independent measurement, the term containing 1/(CT)2 can be dropped.

14722 Biochemistry, Vol. 37, No. 42, 1998 Then the three remaining terms in brackets are reduced to G(TM)2; therefore,

Xia et al. proportional way. The uncertainties in nearest-neighbor parameters are inversely proportional to [(N - )/]1/2, where N is the total number of different sequences, and is the number of parameters in a model (36). Basic Assumptions of Nearest-Neighbor Model. The thermodynamic properties for forming an RNA duplex from random coil can be approximated with a nearest-neighbor model (8, 43). In this model, formation of an RNA duplex includes the concentration-dependent initiation, that is, formation of the first base pair which involves hydrogen bonding and brings two strands together, and the propagation of the helix by base pairing, which includes stacking interactions as well as hydrogen bonding (8). The nearestneighbor model assumes that the contribution of a base pair to the thermodynamic property depends only on the identity of the two adjacent neighbors and that RNA thermodynamics has a linear dependence on the frequencies of base pair doublets. This assumes that the contribution of a given nearest neighbor to the measured property is identical for all occurrences of this nearest neighbor in the sequences under consideration (14, 44, 45). For example, for the INNHB model presented in this paper, which also includes a term for terminal AU pairs and therefore for the base composition of the sequence, the free-energy change of duplex formation can be written as the following:

TM )
2

G(TM)2 (H)
2

TM2

(13)

Due to the correlation between H and S, G(TM) is small compared to H; therefore, the relative uncertainty in TM is usually small. For a typical RNA oligonucleotide duplex, H is about -60 kcal/mol, so an uncertainty of 0.3 kcal/ mol in G results in about 1.6 C uncertainty for a TM of 313 K ) 40 C. The CT calculated from high-temperature absorption and extinction coefficients approximated by a nearest-neighbor model may have significant error, and this can lead to an incorrect mixing ratio of two single strands forming a non-self-complementary duplex. A 10%-20% error in CT contributes about 0.06-0.1 kcal/mol to G(TM) and gives an additional 0.3-0.5 C uncertainty in TM. These uncertainties in the thermodynamic parameters reflect the precision of measurements or parameters (35) and can be treated statistically. The accuracy of the measurement, however, depends on the control and correction of systematic errors, which are difficult to treat statistically (39, 40). Because the true value of a physical quantity is never known, the inaccuracy of a measurement can only be estimated from experience. The ideal way of estimating systematic errors is to compare values for the same quantity measured by totally different techniques. This kind of comparison, however, is rarely available. One comparison that reflects accuracy can be made between UV melting from our lab (41) and substrate inhibition measurements in a group I ribozyme reaction (42). For 4 separate sequences, these completely different techniques give average discrepancies of 8% in G50. UV melting performed on different commercial spectrometers can give an indication of systematic errors from certain sources, for example, faulty calibration and incorrect data analysis. Three DNA sequences have been independently measured by two labs and the discrepancies are 3%, 6%, 6%, and 1 C in G37, H, S, and TM, respectively (39), similar to observed precision of UV melting experiments. For the 90 sequences in our RNA database, we double the average of differences between the thermodynamic parameters obtained by the two vant Hoff analysis methods to estimate the true error limits. Specifically, we use 4% relative error for G37, except for 6 sequences that actually have differences higher than 4% (Table 1). For these 6 sequences, we double their actual differences. Similarly, we used 12% and 13.5% relative errors for H and S, respectively, consistent with the relative error for S being about 13% higher than that of H. While the standard deviations in H and TS both average about 6-7 kcal/ mol at 37 C, the compensation between errors of H and S (correlation coefficient above 0.999) results in final errors in G that average only about 0.3 kcal/mol. This is consistent with the fact that the change in UV absorbance with temperature can be directly related to the equilibrium constant, and therefore G. We assume that the systematic error is within the estimate of experimental error. The estimates of standard deviations are propagated into the uncertainties of nearest-neighbor parameters in a linearly

G(duplex) ) Ginit + njGj(NN) +


j

mterm-AUGterm-AU + Gsym (14)


Each Gj(NN) term is the free-energy contribution of the jth nearest neighbor with nj occurrences in the sequence. The mterm-AU and Gterm-AU terms are the number of terminal AU pairs and the associated free-energy parameter, respectively. The Ginit term is the free energy of initiation. Among other factors, initiation includes translational and rotational entropy loss for converting two particles into one. Theoretically there is a logarithmic dependence of this contribution on the mass of the two particles (46-48). Therefore, in principle, the entropic contribution to initiation can be length- or even sequence-dependent for an RNA duplex. In the nearest-neighbor models we analyzed, the free energy of initiation is assumed to be independent of length. This is a reasonable assumption because the range of sequence lengths in the RNA database is not sufficient to distinguish length-dependent initiation (results not shown). Therefore Ginit is assumed to be a constant value for each duplex. The Gsym term in eq 14 is due to the fact that there is 2-fold rotational symmetry in self-complementary duplexes, but not in single-stranded states or non-self-complementary duplexes. The maintenance of this symmetry makes selfcomplementary duplexes less favorable by a factor of 2 in conformational space (31, 49), that is, there is an extra entropy loss of R ln 2 ) 1.4 eu upon duplex formation. Because there is also a factor of 4 in the equilibrium constant for a non-self-complementary duplex (eq 2), a fair comparison of TM requires non-self-complementary duplexes at twice the strand concentration of self-complementary duplexes (31). At 37 C, Gsym is +0.43 kcal/mol for selfcomplementary duplexes and 0 for non-self-complementary duplexes (9). It is possible that self-complementary duplexes

RNA Nearest-Neighbor Parameters

Biochemistry, Vol. 37, No. 42, 1998 14723

Table 1: RNA Thermodynamic Database: Experimental and Predicted (by INN-HB Model) Thermodynamic Parameters for Duplex Formation in 1 M NaCl, pH 7 experimentalb no 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 sequencesa CCGG CGCG GCGC GGCC ACGCA/ AGCGA/ CACAG/ GCACG/ GCUCG/ ACCGGUp AGCGCU AGGCCU CACGUG CAGCUGp CCAUGG CCGCGG CCUAGG CGCGCGp CGGCCGp CUGCAGp GACGUC GAGAGA/ GAGCUC GAGGAG/ GCAACG/ GCAUCG/ GCAUGC GCCGCG/ GCCGGCp GCGCCG/ GCGCGCp GCGCGG/ GCGGCG/ GCGUCG/ GCUACG/ GCUAGC GGAUCC GGCGCCp GGCGCG/ GGUACC GUCGAC GUGCAC GUGGUG/ GUGUCG/ UCAUGA UCCGGAp UCGCGA UCUAGA UGAUCA UGCGCA AAGGAGG/ ACUGUCA/ AGUCUGA/ GACUCAG/ GAGUGAG/ GUCACUG/ AACUAGUU AAUGCAUUp ACCUUUGC/ ACUAUAGU ACUUAAGU AGAGAGAG/ AGAUAUCU AGUUAACU AUACGUAU AUCUAGAU AUGCGCAUp AUGUACAUp CAAAAAAG/ -G37 (kcal/mol) 4.55 3.66 4.61 5.37 4.97 5.05 4.70 6.17 6.14 8.51 7.99 8.36 6.59 6.68 7.30 9.84 7.80 9.12 9.90 7.11 7.35 6.95 7.98 8.50 7.01 7.26 7.38 10.88 11.20 10.91 10.62 11.38 10.40 8.76 7.56 7.92 7.44 11.33 10.78 7.35 7.09 7.65 7.67 7.18 4.31 7.79 6.85 4.95 5.05 8.22 9.54 7.92 7.52 9.05 9.71 8.62 7.16 7.18 10.64 6.98 6.16 11.12 6.58 6.36 6.53 7.20 10.17 6.49 4.61 -H (kcal/mol) -S (eu) TMd (C) -G37 (kcal/mol) predictedc -H (kcal/mol) 33.81 32.55 36.79 38.05 36.31 37.39 39.15 43.75 44.83 49.17 50.31 51.57 50.71 53.11 53.43 59.33 51.82 58.07 59.33 53.11 54.71 50.95 57.11 55.62 50.57 54.17 56.41 60.82 63.57 60.82 62.31 60.82 60.82 56.39 51.48 54.80 57.43 63.57 60.82 53.66 54.71 54.95 53.46 52.71 44.09 51.25 49.99 42.48 44.09 50.23 59.67 55.55 56.63 64.07 64.07 62.99 54.04 57.11 66.90 57.47 54.04 71.91 61.24 54.04 56.53 61.24 68.99 59.08 51.41 -S (eu) 95.0 93.2 103.4 105.2 100.5 103.7 111.9 121.5 124.7 133.0 136.6 138.4 142.4 147.8 148.8 158.6 143.0 156.8 158.6 147.8 153.6 142.7 159.0 153.4 140.5 151.2 157.2 161.4 168.8 161.4 167.0 161.4 161.4 153.8 142.2 151.4 160.0 168.8 161.4 147.8 153.6 152.6 147.0 146.6 127.4 139.0 137.2 121.6 127.4 136.2 162.1 152.9 156.1 177.1 177.1 173.9 153.6 164.0 182.1 162.8 153.6 196.9 175.0 153.6 162.0 175.0 189.6 168.6 150.5 TMd (C) 25.3 18.8 29.1 34.9 29.0 29.9 24.1 36.7 37.4 51.8 51.6 55.9 42.4 46.6 46.6 62.2 48.1 58.5 62.2 46.6 45.1 40.6 49.0 48.2 42.6 43.9 48.3 62.7 66.6 62.7 63.1 62.7 62.7 51.9 44.9 49.8 48.9 66.6 62.7 49.9 45.1 48.4 47.6 43.8 29.5 52.7 48.3 30.5 29.5 52.0 55.1 48.7 49.0 52.4 52.4 52.2 41.2 40.1 58.4 44.2 41.2 58.9 43.7 41.2 40.4 43.7 58.7 43.0 28.9 refs 29 this work 98 56 this work this work 99 this work this work 29 73 57 this work 9 84 f 100 73 98 9 12 9 80, 105 34 this work this work 9 this work 98 this work 98 this work 101 this work this work 102 100 73 this work 80 12 9 this work this work this work 57 f this work 12 73 g this work this work this work this work this work 12 87 this work 12 12 9 12 12 102 12 102 102 103

Two-State Sequences Used in Regression Analysise 34.21 95.6 27.2 4.36 33.31 95.6 19.3 3.62 30.48 83.4 26.6 4.68 35.79 98.1 34.3 5.42 45.40 130.4 29.4 5.14 46.31 133.0 30.2 5.22 40.20 115.4 24.5 4.45 45.31 126.2 37.5 6.04 43.38 120.1 37.2 6.12 59.78 164.5 53.9 7.94 50.07 135.7 52.0 7.94 48.19 128.4 55.3 8.68 50.31 141.0 42.8 6.54 51.55 144.7 43.1 7.28 56.93 159.9 46.4 7.32 60.79 164.3 59.8 10.14 54.10 149.1 50.0 7.49 54.51 146.4 57.8 9.40 54.12 142.6 63.2 10.14 55.41 155.7 45.3 7.28 58.06 163.5 46.2 7.02 62.05 178.1 40.6 6.67 62.30 175.3 48.7 7.76 55.70 152.2 50.9 8.03 50.57 140.5 42.6 6.97 51.89 143.9 44.1 7.25 62.34 177.2 45.7 7.64 59.69 157.4 63.9 10.73 62.72 166.0 67.2 11.20 57.85 151.3 65.2 10.73 65.98 178.5 62.1 10.46 71.16 192.7 61.9 10.73 58.50 155.0 61.8 10.73 52.38 140.6 53.7 8.64 58.02 162.7 45.0 7.34 59.13 165.1 49.3 7.81 53.70 149.1 47.6 7.80 67.78 182.0 65.2 11.20 63.53 170.1 61.6 10.73 54.90 153.4 46.6 7.81 53.63 150.1 45.3 7.02 59.61 167.5 47.7 7.60 48.84 132.7 47.4 7.87 50.88 140.9 43.7 7.21 41.89 121.2 27.2 4.60 51.92 142.3 50.1 8.16 48.94 135.7 44.6 7.42 36.53 101.8 31.0 4.77 44.73 128.0 32.6 4.60 51.54 139.7 53.1 8.00 58.72 158.6 56.2 9.42 52.24 142.9 48.2 8.14 51.48 141.8 45.7 8.22 64.11 177.5 52.0 9.12 70.49 196.0 53.7 9.12 57.81 158.6 51.1 9.04 54.62 153.0 45.7 6.41 59.81 169.7 45.0 6.28 77.42 215.3 56.3 10.43 59.21 168.4 44.0 6.98 47.23 132.4 40.3 6.41 73.66 201.7 59.6 10.83 64.51 186.8 41.4 6.97 52.42 148.5 41.1 6.41 54.36 154.2 42.0 6.28 59.89 169.9 45.1 6.97 64.39 174.8 60.3 10.20 55.91 159.3 41.7 6.81 53.83 158.7 28.6 4.75

14724 Biochemistry, Vol. 37, No. 42, 1998


Table 1 (Continued) experimentalb no 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 sequencesa CAUGCAUGp CGACGCAG/ CUCGCACA/ GAACGUUC GAUAUAUC GAUGCAUCp GGCUUCAA/ GUAUAUAC GUCUAGAC GUUCGAAC UAGAUCUA UAUGCAUAp UCCUUGCA/ UCUAUAGA UGACCUCA/ UUCCGGAA UUGCGCAA UUGGCCAA UUGUACAA CAAAAAAAG/ AAGGUUGGAA/ CAUGCG/ GAGCUG/ GCUGAG/ GUGCAG/ UAAGGUA/ GAGAUCUC GCCAUGGC GCUGCGAC/ UCCGCGCA/ CACUG/ AGAGAG/ AUGCAUp CGUACG UGGCCAp GCAACGA/ AGUAUACU GAGAGAGA/ GUGAUCAC A7U7 -G37 (kcal/mol) 9.67 12.32 12.11 9.30 6.09 10.12 10.20 5.94 10.11 8.76 7.25 7.27 11.09 6.96 12.34 10.79 10.18 11.00 6.70 5.47 12.69 7.00 7.49 7.72 7.67 6.95 10.11 15.06 13.93 14.59 3.34 6.81 4.73 5.35 8.99 9.20 6.80 11.80 9.49 6.69 -H (kcal/mol) 73.67 70.45 72.61 77.00 62.04 72.75 61.59 63.35 76.02 74.19 60.15 67.73 70.27 62.06 76.09 67.43 62.17 63.68 49.45 59.78 75.84 -S (eu) 206.3 187.4 195.1 218.3 180.4 201.9 165.7 185.1 212.5 211.0 170.6 195.0 190.8 177.7 205.6 182.6 167.6 169.8 137.8 175.1 203.6 TM (C) 54.9 67.1 64.9 52.3 39.1 57.2 59.1 38.3 56.2 50.4 45.3 44.4 60.7 43.5 64.6 62.4 61.2 65.3 43.6 33.8 66.5
d

Xia et al.

predictedc -G37 (kcal/mol) 9.54 12.83 12.13 8.88 6.14 10.02 10.54 6.15 10.15 8.88 7.20 7.08 11.27 7.20 11.51 10.02 9.86 10.60 6.47 5.68 13.10 -H (kcal/mol) 71.79 77.31 73.39 68.35 64.79 75.79 67.94 61.02 72.72 68.35 59.55 58.85 67.84 59.55 69.98 64.89 63.87 65.13 53.96 58.23 80.95 52.17 55.11 55.11 54.03 46.42 76.49 83.19 81.55 76.26 39.15 48.99 43.47 48.16 51.49 59.29 57.47 73.87 74.33 80.17 -S (eu) 200.8 207.8 197.5 191.6 189.0 212.0 185.1 176.8 201.6 191.6 168.8 167.0 182.5 168.8 188.6 177.0 174.2 176.0 153.2 169.5 218.9 145.6 152.0 152.0 148.8 129.8 213.8 222.6 218.0 199.8 111.9 137.3 126.0 135.8 138.0 162.5 162.8 202.3 207.4 236.6 T Md (C) 54.5 66.7 64.8 52.5 39.4 55.9 58.6 39.6 57.5 52.5 45.1 44.4 62.4 45.1 62.8 59.1 58.6 62.1 41.5 34.7 66.2 42.5 47.9 47.9 47.5 37.4 56.4 72.2 70.0 74.3 24.4 38.9 28.1 39.4 56.3 52.3 44.2 59.6 56.2 41.4 refs 87 this work this work 12 9 87 this work 9 9 12 12 87 this work 12 27 this work this work this work this work 103 this work this work this work this work this work g this work h this work this work 104 i 102 100 57 101 12 i j 58

Two-State Sequences Not Used in Regression Analysise 48.57 134.0 42.9 7.01 51.59 142.2 45.5 7.95 55.85 155.2 46.2 7.95 55.94 155.6 46.0 7.87 51.27 142.9 42.2 6.18 75.00 209.2 56.5 10.14 93.91 254.2 71.4 14.16 86.18 233.0 67.9 13.89 81.15 214.6 73.2 14.29 38.82 58.19 41.70 46.60 59.90 78.80 53.10 91.19 71.79 75.70 Non-Two-State Sequencese 114.4 16.4 4.45 165.7 40.7 6.40 119.2 30.1 4.42 133.1 34.6 6.01 164.1 55.3 8.74 224.0 50.2 8.87 149.1 44.1 6.98 256.0 57.6 11.10 200.9 54.4 9.98 222.5 41.0 6.84

a Listed in order of length and then alphabetically for sequences of the same length. Only one strand of each duplex is given, with a slash at the end indicating nonself-complementary duplexes. For nonself-complementary duplexes, the strand that begins with a 5 purine is chosen over pyrimidine if possible. Experimental errors are estimated as 4% for G37, except for CGCG (10%), ACCGGU (11%), GCGCCG/ (12%), CGACGCAG/ (10%), GGCUUCAA/ (12%), and CUCGCACA (10%); experimental errors for H and S are estimated as 12% and 13.5%. b Values are from 1/TM vs ln(CT/a) analysis. c Predicted by the INN-HB model. d Calculated for 0.1 mM strands for self-complementary duplexes and 0.2 mM strands for nonself-complementary duplexes (see Materials and Methods). Some experimental TMs are calculated with additional digits in H and S. e Criterion for two-state transition is agreement within 15% for H values derived from 1/T vs ln(C /a) plots and from the average of the fitting M T of melting curves. f D. Leffel and D. H. Turner, unpublished results. g T. W. Barnes and D. H. Turner, unpublished results. h X. Chen and D. H. i j Turner, unpublished results. J. A. Jaeger and D. H. Turner, unpublished results. S. M. Freier and D. H. Turner, unpublished results.

might form with structures that do not have 2-fold rotational symmetry. Thus a floating symmetry correction term was tested. The regression analysis gave a Gsym of +0.42 kcal/ mol, very close to the theoretical value. Therefore this symmetry correction is fixed at the +0.43 kcal/mol for Gsym and -1.4 eu for Ssym in the final regression analysis. Equations similar to eq 14 can be written for H and S, but there is no symmetry correction term in the equation for H. Knowledge of each individual term in eq 14 allows calculation of the free-energy change of duplex formation for RNA of any sequence within the framework of a nearestneighbor model. The thermodynamic parameters of base pair

doublets and initiation were found by fitting the nearestneighbor model to experimental data. Error-Weighted Multiple Linear Regression Analysis. The regression function (50, 51) for the thermodynamics of Watson-Crick paired duplex formation is the following:

Gi(duplex) - Gsym ) Gi(total) ) Ginit +


j

nijGj(NN) + miGterm-AU (i ) 1, 2,N)

(15)

where each different oligonucleotide duplex is labeled by index i. Because the Gsym is a known fixed value only depending on the symmetry property of the sequence, it is

RNA Nearest-Neighbor Parameters subtracted from the experimental Gi(duplex) to form Gi(total). Because Ginit has the same value for each sequence, it forms the intercept in the linear regression analysis. Various combinations of nearest-neighbor parameters can be used and the terminal AU term can be omitted, so the number of parameters depends on the specific model. Extraction of the thermodynamic parameters by singularvalue decomposition (52) is described in Supporting Information (see paragraph at the end of paper regarding Supporting Information). Matrix calculations were performed using the program MATHEMATICA version 3.0 (53) as described previously (54). Independent calculations were also performed using SAS statistical software (55). Essentially identical results were obtained by the two different methods. To verify the calculation methods, we derived the nearest-neighbor parameters for the RNA database of Freier et al. (9) and reproduced the literature values. When separate calculations were performed for G37, H and S, they agreed with the standard thermodynamic equation, G37 ) H - 310.15S; therefore the reported S parameters are derived from this equation. Statistical Tests. In general, we use the Principle of Maximum Likelihood (36) to obtain the best solution in a least-squares problem. It is assumed that our data are normally distributed. Each regression analysis included an examination of residuals as a check on the required assumptions of normally distributed errors. If the residuals follow normal distribution, the error-weighted 2 will follow the 2 distribution with numbers of degrees of freedom of f ) N . The probability density function (pdf) of 2 distribution is given by (36)

Biochemistry, Vol. 37, No. 42, 1998 14725 level of Q is often set at a low level, Q g 0.001 - 0.05 (Press et al., 1992). Typically, erroneous models can be identified unambiguously because they give very small values of Q, often less than 10-10. Although for most purposes the Q test is an adequate measure of the goodness-of-fit, it should be realized that 2 is a function of the quality of the data as well as the choice of the model (35). When a model is extended, the F test can be used to test whether there is a significant difference between fits of the two related models to a sample by comparing the observed reduced chi-squares, f2 ) 2/f, for the two models. An F value, formed by the ratio of two reduced 2s, F ) f12/ f22, is distributed according to the F distribution (35, 36):

f(F;f1,f2) )

[(f1 + f2)/2] f1 (f1/2)(f2/2) f2

()

f1/2

F1/2(f1 - 2) (1 + Ff1/f2)1/2(f1 + f2)

(18)

CS( 2;f) )

( 2)(f - 2)/2 (f/2)2f/2

exp (- 2/2)

(16)

where (f/2) is the gamma function. One can calculate the upper-tail area function by integrating CS( 2;f) from a specific value of 2 to infinity. This integral, conventionally denoted by Q( 2;f), is the incomplete gamma function (36):

The smaller the F probabilities, the more significant the difference between the two model fits. A particular use of the F distribution is to determine whether an expansion in the number of parameters is statistically justified. The F value between two models is given by ( 2/)/ f2, where 2 is the difference between two 2 values, is the difference in degrees of freedom, and f2 is for the model with the additional parameters. This ratio is a measure of how much the additional terms have improved the value of the f2 and should be large if the new parameters are worth adding. Besides testing the whole model, we can also test the significance of each individual parameter in the model. The Students t test is appropriate for testing whether a mean of a sample is significantly different from the mean of the population. For our purpose, we test whether a parameter is significantly different from 0. The pdf for the t distribution for f degrees of freedom is (36) the following:

St(t;f) )

[(f - 1)/2]! [(f - 2)/2]! f

(1 + t2/f)-(f + 1)/2

(19)

Q( 2;f) )

(f/2, 2/2) (f/2)

(17)

Q is the probability of getting a value of 2 larger than a specific value (e.g., a specific observed value) by random chance alone. This probability gives a quantitative measure for the goodness-of-fit of the model. A higher value of Q indicates a better fit. Because the expectation value for 2 is the number of degrees of freedom, the expectation value for Q from eq 17 is approximately 0.5 for large sample size. A very small Q value indicates that the apparent discrepancies between the data set and the model are unlikely to be chance fluctuations. A small Q can be due to either an incorrect model or an underestimate of uncertainties in the measurements. We can reject a model on the grounds that it is too improbable that chance fluctuations yield a value of 2 as large as that observed. We can never prove that a given model is physically correct, however, because there is always the possibility that the model closely approximates reality, and that our data are not sufficiently sensitive to detect the difference (35). Because its always possible that the residuals may not be normally distributed, the acceptable

where the test statistic t is the quotient of the estimate of the parameter and the estimate of standard error associated with it, for example, t ) Gj(NN)/j(NN). This quotient follows a t distribution for f ) N - degrees of freedom (36). The smaller the t probability, the more significant is the parameter. The t test is useful for cases where sample size is small (typically less than 30) and the population variance can only be estimated. RESULTS RNA Database. Table 1 lists thermodynamic parameters of duplex formation derived from 1/TM vs log(CT/a) plots using the analysis described in Materials and Methods. One sequence, 5UCAUGA3, was remeasured since it appeared to be unusual in previous studies (9, 12), and the new values are listed in Table 1. Thermodynamic parameters from averages of fits of melting curves are supplied in the Supporting Information. The first 90 sequences were used in regression analysis to obtain nearest-neighbor parameters.

14726 Biochemistry, Vol. 37, No. 42, 1998 The sequences vary in length from tetramer to decamer, spanning the most common helix lengths in known RNA secondary structures (1). Because they were synthesized with T4 RNA ligase, 15 of the sequences have a 3 terminal phosphate. No correction was made for these phosphates since 3 terminal phosphates have been observed to both stabilize and destabilize duplexes and the effect is less than 0.2 kcal/mol phosphate, similar to random experimental error (9, 27, 56-58). In the database of 90 sequences, all the nearest neighbors are well represented with the following frequencies of AA3 5AU3 5UA3 occurrences: 5 3UU5 ) 41, 3UA5 ) 32, 3AU5 ) 28, 5CU3 5CA3 5GU3 5GA3 3GA5 ) 65, 3GU5 ) 66, 3CA5 ) 56, 3CU5 ) 67, 5CG3 ) 53, 5GG3 ) 45, 5GC3 ) 61. There are 51 3GC5 3CC5 3CG5 duplexes that have only GC terminal base pairs, 33 that have only AU terminal base pairs, and 6 that have one GC and one AU terminal base pair. Also listed in Table 1 are 9 apparent two-state sequences not used in regression analysis so they could provide tests of the parameters, and 10 sequences that do not have two-state behavior. Limits of Nearest-Neighbor Model. If the nearest-neighbor model is exact, then thermodynamic parameters measured for duplexes with different sequences but identical nearest neighbors should be identical within experimental error. Table 2 lists 13 pairs or groups of such duplexes in which the ends are also identical. Differences in thermodynamic parameters that are larger than experimental error are observed, indicating small non-nearest-neighbor effects. On average, the measured G37, H, S, and TM for a given sequence differ from the average values of all sequences containing the identical nearest neighbors by (3.0%, (4.4%, (4.9%, and (1.2 C. This provides an empirical measure of the limits of nearest-neighbor models for predicting thermodynamic parameters of duplexes. Table 3 lists thermodynamic parameters measured for 4 sets of duplexes with identical nearest neighbors, but different ends. A sequence with only GC ends must have one more GC pair and one less AU pair than a sequence with the same nearest neighbors but only AU ends. This provides a test of potential base composition and end effects on nearestneighbor models. Duplexes with only GC ends are consistently more stable than duplexes with only AU ends, with an average enhancement of 1.3 kcal/mol in G37 (Table 3). This difference could be either a base composition or an end effect, or both. The results indicate that a nearestneighbor model for RNA should contain terms that distinguish between sequences with identical nearest neighbors but different base compositions and, therefore, different ends. The INN-HB Model Based on Stacking and Hydrogen Bonding. There is evidence that the main interactions stabilizing double helix formation in RNA are stacking and hydrogen bonding (1, 59, 60). The free-energy change associated with a nearest-neighbor interaction is the freeenergy change for propagating the helix by either of the base pairs comprising the nearest neighbor. This suggests that the free-energy change can largely be attributed to stacking between the two base pairs and formation of hydrogen bonds in the two base pairs. Because each internal base pair participates in two nearest neighbors, its contribution to duplex stability is partitioned to both nearest neighbors;

Xia et al.

FIGURE 1: Comparisons between numbers of Watson-Crick hydrogen bonds in duplexes and counts of hydrogen bonds by the INN-HB model. For simplicity, initiation is assumed to start at the second base pair, but exactly the same sums are calculated for initiation at any other base pair, as long as 3 hydrogen bonds are assigned to initiation. The INN-HB model correctly counts the number of hydrogen bonds for all cases. Note that the INN model (8, 9) lacks the terminal AU term and uses a different initiation parameter for duplexes with and without a GC pair. Thus it correctly counts hydrogen bonds for sequences with only terminal GC pairs or sequences containing only AU pairs, but overcounts for GCcontaining sequences with at least one terminal AU pair. Thus the INN model cannot predict differences between duplexes in Table 3.

therefore each nearest-neighbor interaction includes the stacking between the two base pairs and only half of the hydrogen-bonding contribution from each base pair. To be exact, these contributions are averaged over the influences of bordering base pairs of all possibilities. Thus, for GG3 example, G values for the nearest neighbors 5 3CC5 , 5AA3, and 5GA3 include the stabilizing effects of 3UU5 3CU5

RNA Nearest-Neighbor Parameters

Biochemistry, Vol. 37, No. 42, 1998 14727

Table 2: Comparisons of Thermodynamic Parameters for Sequences with Identical Nearest Neighbors and Identical Ends frequencies of occurrences of nearest neighbors sequences CAGCUG CUGCAG average predictionc CCGCGG CGGCCG average predictionc GACGUC GUCGAC average predictionc GAGCUG/ GCUGAG/ average predictionc GCCGCG/ GCGCCG/ GCGCGG/ GCGGCG/ GGCGCG/ average predictionc GCCGGC GGCGCC average predictionc UCAUGA UGAUCA average predictionc GACUCAG/ GAGUGAG/ average predictionc AACUAGUU ACUUAAGU AGUUAACU average predictionc ACUAUAGU AGUAUACUe average predictionc AGAUAUCU AUCUAGAU average predictionc GAACGUUC GUUCGAAC average predictionc UCUAUAGA UAGAUCUA average predictionc AA UU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 0 0 0 0 2 2 0 0 AU UA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 2 2 0 0 1 1 UA AU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 1 1 0 0 2 2 CU GA 2 2 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 0 0 2 2 CA GU 2 2 0 0 0 0 1 1 0 0 0 0 0 0 0 2 2 1 1 0 0 0 0 0 0 0 0 0 0 0 GU CA 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 1 1 2 2 2 2 2 0 0 2 2 0 0 GA CU 0 0 0 0 2 2 1 1 0 0 0 0 0 0 0 2 2 2 2 0 0 0 0 0 2 2 2 2 2 2 CG GC 0 0 2 2 1 1 0 0 2 2 2 2 2 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 GG CC 0 0 2 2 0 0 0 0 1 1 1 1 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 GC CG 1 1 1 1 0 0 1 1 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -G37 (kcal/mol) 6.68 7.11 6.90 7.28 9.84 9.90 9.87 10.14 7.35 7.09 7.22 7.02 7.49 7.72 7.61 7.95 10.88 10.91 11.38 10.40 10.78 10.87 10.73 11.20 11.33 11.25 11.20 4.31 5.05 4.68 4.60 9.05 9.71 9.38 9.12 7.16 6.16 6.36 6.67 6.41 6.98 6.80e 6.89 6.98 6.58 7.20 6.89 6.97 9.30 8.76 9.03 8.88 6.96 7.25 7.11 7.20 thermodynamic parametersa -H (kcal/mol) 51.55 55.41 53.48 53.11 60.79 54.12 57.45 59.33 58.06 53.63 55.85 54.71 51.59 55.85 53.72 55.11 59.69 57.85 71.16 58.50 63.53 62.15 60.82 62.72 67.78 65.25 63.57 41.89 44.73 43.31 44.09 64.11 70.49 67.30 64.07 54.62 47.23 52.42 52.84 54.04 59.21 53.10e 56.16 57.47 64.51 59.89 62.20 61.24 77.00 74.19 75.60 68.35 62.06 60.15 61.11 59.55 -S (eu) 144.7 155.7 150.2 147.8 164.3 142.6 153.4 158.6 163.5 150.1 156.8 153.6 142.2 155.2 148.7 152.0 157.4 151.3 192.7 155.0 170.1 165.3 161.4 166.0 182.0 174.0 168.8 121.2 128.0 124.6 127.4 177.5 196.0 186.8 177.1 153.0 132.4 148.5 149.1 153.6 168.4 149.1e 158.8 162.8 186.8 169.9 178.4 175.0 218.3 211.0 214.7 191.6 177.7 170.6 174.2 168.8 TM b (C) 43.1 45.3 44.2 46.6 59.8 63.2 61.4 62.2 46.2 45.3 45.8 45.1 45.5 46.2 45.9 47.9 63.9 65.2 61.9 61.8 61.6 62.8 62.7 67.2 65.2 66.2 66.6 27.2 32.6 29.9 29.5 52.0 53.7 52.8 52.4 45.7 40.3 41.1 42.5 41.2 44.0 44.1e 44.0 44.2 41.4 45.1 43.1 43.7 52.3 50.4 51.3 52.5 43.5 45.3 44.3 45.1 ref 9 9 d 98 12 12 d d d d d 101 d 98 73 d 12 d d 12 12 12 12 12 12 12 12 12 12 12

a From 1/T vs ln(C /a) data. b Calculated for 0.1 mM strands for self-complementary duplexes and 0.2 mM strands for nonself-complementary M T duplexes (see Materials and Methods). Average TM values are calculated using averages of H and S. c Predicted by the INN-HB model. d This work. e Non-two-state sequence.

stacking and 3, 2, and 21/2 hydrogen bonds, respectively. The other half of the hydrogen-bonding contribution is reflected in the adjacent nearest-neighbor interactions in the sequence (see Figure 1). Terminal base pairs only participate in one nearest neighbor; therefore half of the hydrogen bonds in the two terminal base pairs are not included in any nearestneighbor parameters. Half of the hydrogen bonds in two terminal GC pairs can be included in the initiation term. Effectively, the initiation term includes the energy of forming

3 hydrogen bonds, in addition to other terms such as the entropy loss of aligning two single strands (see Materials and Methods). As illustrated by the sequences in Table 3, it is possible to construct duplexes with identical nearest neighbors but different base compositions and therefore different total numbers of hydrogen bonds. Because the same initiation parameter is used for every sequence, this suggests that the nearest-neighbor model should include a term that accounts

14728 Biochemistry, Vol. 37, No. 42, 1998


Table 3: Comparisons of Thermodynamic Parameters for Sequences with Identical Nearest Neighbors but Different Ends

Xia et al.

thermodynamic parameters frequencies of occurrences of nearest neighbors sequences 5ACGCA3 3UGCGU5 5GCACG3 3CGUGC5 5AGCGA3 3UCGCU5 5GCUCG3 3CGAGC5 5ACUGUCA3 3UGACAGU5 5GUCACUG3 3CAGUGAC5 5AGUCUGA3 3UCAGACU5 5GACUCAG3 3CUGAGUC5 5GAGUGAG3 3CUCACUC5
d

Experimentala AUc END 2 0 2 0 2 0 2 0 0 GCc END 0 2 0 2 0 2 0 2 2 -G37 (kcal/mol) 4.97 6.17 5.05 6.14 7.92 8.62 7.52 9.05 9.71 T Md (C) 29.4 37.5 30.2 37.2 48.2 51.1 45.7 52.0 53.7

predictedb -G37 (kcal/mol) 5.14 6.04 5.22 6.12 8.14 9.04 8.22 9.12 9.12 T Md (C) 29.0 36.7 29.9 37.4 48.7 52.2 49.0 52.4 52.4

AA UU 0 0 0 0 0 0 0 0 0

AU UA 0 0 0 0 0 0 0 0 0

UA AU 0 0 0 0 0 0 0 0 0

CU GA 0 0 1 1 1 1 2 2 2

CA GU 1 1 0 0 2 2 1 1 1

GU CA 1 1 1 1 2 2 1 1 1

GA CU 0 0 1 1 1 1 2 2 2

CG GC 1 1 1 1 0 0 0 0 0

GG CC 1 1 0 0 0 0 0 0 0

GC CG 0 0 0 0 0 0 0 0 0

a From 1/T vs ln(C /a) data. b Predicted by the INN-HB model. c Include both orientations of terminal base pairs (see Results and Discussion). M T Calculated for 0.2 mM for nonself-complementary duplexes (see Materials and Methods).

for the dependence of the total number of hydrogen bonds on base composition. A simple way to do this is to include a term for terminal AU pairs that is equivalent to the freeenergy increment for removal of half of a hydrogen bond. This INN-HB model contains separate parameters for initiation, 10 nearest neighbors, and helix termination by an AU pair, giving a total of 12 parameters. The INN-HB model assumes that stacking and hydrogen-bonding interactions at the end of a helix are identical to those in the middle of a helix. All of the first 90 sequences used in the analysis contain GC pairs; thus the initiation can be treated as involving formation of 3 hydrogen bonds. The same initiation parameter should suffice for duplexes with only AU pairs since the terminal AU term corrects the overcounting of hydrogen bonds. The role of the terminal AU parameter in accounting for hydrogen bonds is illustrated for several cases in Figure 1, where the numbers of actual hydrogen bonds are compared to the numbers contained in the INN-HB model. Note that the INN model (8, 9) does not contain a terminal AU term, so that it essentially overestimates the number of hydrogen bonds in GCcontaining duplexes with at least one terminal AU base pair. Singular-value decomposition of the matrix formed by thermodynamic parameters of the first 90 duplexes in Table 1 confirms that the rank is 12, therefore allowing determination of 12 linearly independent parameters (see Supporting Information). For the INN-HB model, these parameters are listed in Table 4. Predictions based on this model are listed alongside experimental results in Table 1. The root-meansquare (rms) deviations between predicted and measured values of G37, H, and S are 0.35 kcal/mol, 4.65 kcal/ mol, and 14.6 eu, respectively, corresponding to relative mean deviations of 3.2%, 6.0%, and 6.8%, respectively. The mean deviation for TM is 1.3 C. The residuals are compatible with the observed limits of the nearest-neighbor model (Table 2). Thus the INN-HB model can predict within the experimental limits of the nearest-neighbor model (12, 17, 18, 54). For an 11 parameter fit of the same data to the

Table 4: RNA Thermodynamic Parameters for INN-HB Nearest-Neighbor Model, 1 M NaCl, pH 7a


parameters 5AA3 3UU5 5AU3 3UA5 5UA3 3AU5 5CU3 3GA5 5CA3 3GU5 5GU3 3CA5 5GA3 3CU5 5CG3 3GC5 5GG3 3CC5 5GC3 3CG5 G37 (kcal/mol) -0.93 (0.03) -1.10 (0.08) -1.33 (0.09) H (kcal/mol) S b (eu)

-6.82 (0.79) -19.0 (2.5) -9.38 (1.68) -26.7 (5.2) -7.69 (2.02) -20.5 (6.3)

-2.08 (0.06) -10.48 (1.24) -27.1 (3.8) -2.11 (0.07) -10.44 (1.28) -26.9 (3.9) -2.24 (0.06) -11.40 (1.23) -29.5 (3.9) -2.35 (0.06) -12.44 (1.20) -32.5 (3.7) -2.36 (0.09) -10.64 (1.65) -26.7 (5.0) -3.26 (0.07) -13.39 (1.24) -32.7 (3.8) -3.42 (0.08) -14.88 (1.58) -36.9 (4.9) 4.09 (0.22) 0.45d (0.04) 0.43 0 3.61 (4.12) 3.72d (0.83) 0 0 -1.5 (12.7) 10.5d (2.6) -1.4 0

initiationc

per terminal AUd symmetry correction (self-complementary) symmetry correction (non-self-complementary)

a Numbers in parentheses are uncertainties for parameters. b Calculated from nearest-neighbor parameters for G37 and H (see Materials and Methods). c Includes potential GC end effects. d Parameter is per terminal AU pair.

INN model (9), rms deviations are 0.52 kcal/mol, 5.81 kcal/ mol, and 17.7 eu, corresponding to relative mean deviations of 5.1%, 7.8%, and 8.6% for G37, H, and S, respectively, and 1.9 C for TM. The INN model does not account for the dependence of the number of hydrogen bonds on base composition. The rms values for both of the models are about 30% higher than the mean deviations, close to the expected 25% for the large sample size (36). Parameters derived for the INN model are listed in Supporting Information.

RNA Nearest-Neighbor Parameters


Table 5: Statistical Comparisons of Nearest-Neighbor Models statistical measures of fits models INN-HB parameters G37 H S G37 H S G37 H S
f 2a

Biochemistry, Vol. 37, No. 42, 1998 14729 , of 1.1, close to the expectation value of 1. In contrast, for the INN model is much higher, 2.9. This improvement corresponds to an F probability of 4 10-19 (see Materials and Methods). Consistent with this is that the t test shows that each of the 12 parameters in the INN-HB model is significant, with t probabilities ranging from 10-19 to 10-60. In particular, the t probability for the terminal AU parameter is 4.2 10-19; therefore this parameter is very meaningful (see Materials and Methods). These results indicate good quality of fits and applicability of the INNHB model to RNA thermodynamics. There is also a 0.17 kcal/mol improvement in the rootmean-square value of residuals for the INN-HB compared to the INN model (Table 5). This improvement is not as insignificant as it appears. Because the uncertainty in experimental G37 is on average 0.3 kcal/mol, a 0.17 kcal/ mol difference makes a significant improvement. Also, the value of the terminal AU parameter, 0.45 kcal/mol, is over the limit of experimental uncertainty, so the data can detect the existence of this parameter. In general, omission of the terminal AU parameter in the INN model (8, 9) tends to result in overestimation of stability for duplexes with terminal AU base pairs and underestimation for duplexes with terminal GC base pairs. The residuals between measurements and predictions by the INN-HB model are more randomly distributed. Thus statistical analysis of the database of 90 duplexes is consistent with the conclusion drawn from the results in Table 3: Duplexes with terminal AU and GC base pairs have different stabilities even if they have the same nearest neighbors. The INN-24 model fits the data marginally better than 12 parameter models (Table 5). The t test, however, shows that the difference between the average of 8 AU ends and the average of 8 GC ends is more significant than the differences within them. Therefore the simpler 12 parameter models are adequate to rationalize the sequence dependence of the RNA thermodynamics. Predictions for Duplexes not Used in Fitting. Another measure of nearest-neighbor parameters is how well they predict the thermodynamics of sequences not used in deriving the parameters. Thermodynamics of 9 such two-state duplexes measured after analysis of the first 90 duplexes in Table 1 provide a test of predictions from Table 4. Table 1 lists the measured and predicted G37, H , S, and TM. The rms residuals for these three duplexes are 0.47 kcal/ mol, 5.19 kcal/mol, and 15.5 eu for G37, H , and S, corresponding to relative mean deviations of 3.5%, 5.9%, and 6.6%, respectively. The average difference between measured and predicted TM is 1.6 C. This probably approximates the level of accuracy that can be expected in predictions. Also listed in Table 1 are thermodynamic parameters determined from 1/TM vs ln(CT/a) plots for sequences that do not melt in a two-state manner. Non-two-state melting usually occurs with long or AU-rich oligonucleotide duplexes. Accurate prediction for non-two-state molecules requires statistical treatment that includes intermediate states in both duplex and single strands (31, 43, 62-69). Thus the thermodynamic parameters reported for these non-twostate duplexes are less reliable because they are derived under the two-state assumption. Nevertheless, comparison with
f f 2 2

Qb 0.32 >0.99 >0.99 0.987 >0.99 >0.99 5 10-16 0.97 0.99

rmsc 0.35 kcal/mol 4.65 kcal/mol 14.6 e.u. 0.31 kcal/mol 4.59 kcal/mol 13.2 e.u. 0.52 kcal/mol 5.81 kcal/mol 17.7 e.u.

INN-24

INN

1.1 0.47 0.47 0.65 0.46 0.42 2.9 0.71 0.68

2 a Reduced chi-square, 2/f (see Materials and Methods). f ) bCumulative chi-square distribution probability (see Materials and Methods). c Root-mean-square of residuals.

An INN-24 Model Containing Corrections for Terminal Nearest Neighbors. Further expansions of the nearestneighbor model are also possible. For example, another possible model distinguishes between internal and terminal nearest neighbors by assuming that they can be different. This model has 27 possible parameters: 1 for initiation, 10 for internal nearest neighbors, and 16 for terminal nearest neighbors. Three constraint equations described in the Discussion (eq 20-22), however, limit the number of independent parameters accessible from an oligonucleotide database to 27 - 3 ) 24 parameters. Thermodynamic parameters derived by fitting the first 90 duplexes in Table 1 to this INN-24 model are listed in Supporting Information. The errors associated with the INN-24 parameters are larger than for the INN-HB model because the number of parameters is increased. With the INN-24 parameters, the rms deviations between predicted and measured G37, H, and S values are 0.31 kcal/mol, 4.59 kcal/mol, and 13.2 eu, corresponding to relative mean deviations of 2.5%, 5.3%, and 6.1%, respectively. The mean deviation for TM is 1.2 C. Once more data have been collected, a triplet model, which takes next nearest neighbors into account, can also be fitted (61). Statistical Comparisons of Models. As described in Materials and Methods, statistical measures have been developed to test the significance of parameters used in fitting models to data (35, 36, 52). For example, Q is the probability of getting a value of 2 larger than an observed value by random chance alone. Table 5 contains comparisons of such measures for the INN (8, 9), INN-HB, and INN24 models. Many other INN models with 12 parameters as well as the ISS model of Gray (10) have the same statistics as those for the INN-HB model (see Discussion). Note that these comparisons depend on the errors assumed for the experimental measurements. In general, larger errors decrease the statistical significance of the differences between the models (35), but the trends are the same. Thus G37 discriminates best between models because it has a relatively small error compared to H and S (see Materials and Methods). The most dramatic difference in statistical measures of models is the increase in Q for the fits to G37 from 5 10-16 to 0.32 upon increasing the parameters in the fit of GC-containing duplexes from 11 in the INN model to 12 in the INN-HB model (or any other such 12 parameter model). Overall, the INN-HB model has a reduced chi-square value,

14730 Biochemistry, Vol. 37, No. 42, 1998 values predicted from the nearest-neighbor parameters in Table 4 provides a qualitative test of the model. The rms of differences between predicted and measured values of G37, H, and S are 0.54 kcal/mol, 9.42 kcal/mol, and 29.4 eu corresponding to relative mean deviations of 8.1%, 9.7%, and 10.7%, respectively. The average difference between measured and predicted TM is 2.4 C. Evidently, the nearest-neighbor model can even provide useful approximations for many RNA sequences that are not twostate. DISCUSSION Watson-Crick base pairs form the helical framework for RNA structure and are also central to RNA-RNA recognition processes (70-72). Thus their thermodynamic properties are important for predicting RNA structure and RNARNA associations. Previous studies of RNA duplex formation by Watson-Crick paired oligoribonucleotides showed that a nearest-neighbor model fits the sequence dependence of thermodynamic properties well (8, 9). Here we enlarge the database of duplexes analyzed from 45 to 90 sequences. The results are fit significantly better by expanding the nearestneighbor model. A physical model, the INN-HB model, is proposed in which an additional term, the terminal AU term, makes the predicted thermodynamic parameters dependent on base composition and therefore the number of WatsonCrick hydrogen bonds in the duplex. EVidence for Expanding the Nearest-Neighbor Model. The results in Table 3 provide the most straightforward evidence for expanding the nearest-neighbor model. Duplexes with the same nearest neighbors but different base compositions and therefore different ends consistently have different stabilities. The duplex with one more GC pair and one less AU pair is always more stable, with a range in enhancement of stability of 0.7-2.2 kcal/mol. Switching a GC pair to an AU pair in base composition decreases the number of hydrogen bonds in the duplex by 1. The INN model (8, 9) predicts that sequences with the same nearest neighbors but different ends have the same stabilities because there is no term in the INN model to account for the difference in base compositon. To account for differences in base composition and therefore the number of duplex hydrogen bonds, a term for terminal AU pairs can be added to the INN model to make the INN-HB model (see Figure 1). As shown in Table 5 for the data set of 90 duplexes, this term improves the rms of residuals in predicted G37 by only 0.17 kcal/mol. Other statistical measures of fit, such as the Q probability, are improved much more, however (see Table 5). That is, the average residual is improved by only a small magnitude, but the deviations are more randomly distributed, thus providing improved statistics when compared to uncertainties of measurements. Therefore a statistical analysis of the database also provides evidence for at least one additional term. EVidence for a Model Based on Hydrogen Bonds. As discussed below, there are many physical models that can rationalize an additional term in the nearest-neighbor model. As illustrated in Figure 1, the additional term for terminal AU pairs in the INN-HB model is associated with subtraction of half of a hydrogen bond per terminal AU pair from the

Xia et al. hydrogen bonds contained in the nearest neighbors and initiation. The results in Table 3 show that duplexes with two terminal GC pairs are on average 1.3 kcal/mol more stable than duplexes with the same nearest neighbors and two terminal AU pairs. Thus these results suggest that the third hydrogen bond in a GC pair contributes on average 1.3 kcal/mol to stability. The statistical treatment of the 90 duplex database gives a penalty of 0.45 kcal/mol for each terminal AU pair (Table 4), suggesting that each hydrogen bond contributes on average 0.9 kcal/mol to stability. Empirical estimates of the contributions of terminal hydrogen bonds to RNA duplex stability range from 0.7 to 1.7 kcal/ mol of hydrogen bonds, depending on the sequence (73, 74). Comparisons of other nucleic acid duplexes (75-78) with different numbers of internal hydrogen bonds suggest that each hydrogen bond contributes 0.5-1.8 kcal/mol (60). Thus the magnitude of the additional term in the INN-HB model is consistent with a physical model in which the term arises from the dependence of the numbers of Watson-Crick hydrogen bonds on base composition. Studies of thermodynamic stabilities of hairpins and internal loops are also consistent with hydrogen bonding as the origin of the additional term in the INN-HB model. Serra et al. (79) found that hairpins closed by AU pairs are on average 0.6 kcal/mol less favorable than hairpins closed by GC pairs, when the free energies of nearest neighbors in the stems are calculated with the INN parameters of Freier et al. (9). When recalculated with the INN-HB model (Table 4) applied to the hairpin stems, this difference vanishes (D. H. Mathews and D. H. Turner, unpublished results). Thus this difference can be completely attributed to the INN model overcounting by half of a hydrogen bond the number of hydrogen bonds in the stem of a hairpin loop closed with an AU pair. Wu et al. (80) investigated stabilities of symmetric tandem mismatches and found that destabilizing mismatches are less stable when the adjacent pair is AU rather than GC. When recalculated with the INN-HB model, in which AU pairs adjacent to mismatches are considered terminal AU pairs, destabilizing tandem mismatches with two adjacent AU pairs are on average only 0.3 kcal/mol less stable than those with two adjacent GC pairs. Thus the INN-HB model is largely able to account for the difference in stabilities of destabilizing tandem mismatches closed by AU or GC pairs. Presumably these destabilizing mismatches are relatively unstructured. Symmetric tandem mismatches comprised of GU, GA, or UU pairs can be stabilizing, and NMR evidence indicates that there are significant hydrogen bonding interactions in the mismatches, as well as stacking between mismatches and stacking between mismatches and closing base pairs (30, 81-84). For tandem GU, GA, and UU mismatches, stabilities are therefore more dependent on the identity and orientation of the closing base pairs and do not fit a simple model. The above comparisons suggest that the hydrogen-bonding model is more reasonable than a model in which the additional term is attributed to a difference between terminal and internal base pairs or to a difference between the interactions of terminal GC and AU pairs with their environment. Presumably the environments of terminal base pairs adjacent to water, hairpin loops, and internal loops are quite different. Nevertheless, it appears that the term accounting

RNA Nearest-Neighbor Parameters for terminal base pair and/or composition effects is similar for all three environments. Comparisons with DNA and RNA/DNA Hybrid Results. While the RNA results are consistent with a hydrogenbonding model, it is surprising that a term accounting for numbers of hydrogen bonds in DNA duplexes is small at 37 C (0.05 kcal/mol) (15, 16). This suggests compensating effects at termini of DNA duplexes. For example, terminal AT pairs in DNA may be more flexible than terminal AU pairs in RNA, thus allowing them to more effectively maximize other interactions. Another possibility is that AT pairs have an extra methyl group that is thought to stabilize duplexes by about 0.3 kcal/mol of AT pair (ref 85; Allawi & SantaLucia, unpublished results). The methyl group contribution may also explain the smaller differences observed in G values upon substituting an AT pair with a GC pair in DNA nearest neighbors compared with substituting AU for GC in RNA nearest neighbors (compare Table 4 with Table 1 of ref 15). For RNA/DNA hybrids (18), statistical treatment (SantaLucia, unpublished results) shows that differential treatment of terminal rAdT, dArU, rGdC, and dGrC is also not necessary. A Non-Nearest-Neighbor Effect. Table 2 shows that pairs or groups of sequences with identical nearest neighbors and identical ends do not have identical thermodynamic properties. For example, on average the G37 for a given sequence differs by (3% from the average value for all sequences containing the identical nearest neighbors. This suggests non-nearest-neighbor effects. Comparison of the results with the INN-HB parameters in Table 4 reveals a trend. The more stable duplex of a pair has the more favorable nearestneighbor doublets closer to the middle of the duplex than the less stable duplex. Although the effect is small, it appears that the position of a nearest neighbor relative to the duplex end also affects stability. AlternatiVe Nearest-Neighbor Models. A large number of alternative models can also distinguish between sequences with identical nearest neighbors but different ends and base compositions. The number of parameters that can be independently determined for a given model, however, is limited by sequence constraints on duplexes. For example, one possible INN model, the INN-15 model, contains separate parameters for initiation, 10 nearest neighbors, and 4 ends giving a total of 15 parameters. Here the ends are A3, U3, G3, and C3, representing different identities and U5 A5 C5 G5 orientations of terminal base pairs. Following the reasoning of Gray and Tinoco (14), Goldstein and Benight (13), and Gray (10), there are two constraints on possible combinations of these parameters in duplexes:
AU AC AG A3 AA N(AA UU) + 2N(UA) + N(UG) + N(UC ) + N(U5) - N(UU) CA GA U3 2N(UA AU) - N(GU) - N(CU ) - N(A5) ) 0 (20) GU GC GG G3 AG N(GA CU ) + N(CA ) + 2N(CG) + N(CC ) + N(C5 ) - N(UC ) CG GG C3 N(UG AC ) - 2N(GC) - N(CC ) - N(G5) ) 0 (21)

Biochemistry, Vol. 37, No. 42, 1998 14731 Due to the high cooperativity of melting behavior, only the paired Watson-Crick neighbors appear in the constraint equations (86). In other words, if the cooperative unit is larger than the duplex (which is typically true for short oligonucleotide duplexes), there is no local melting and the population of intermediate states at any temperature is negligible. That is, an RNA strand is either completely paired or completely in random coil. Therefore any internal base pair is always bordered by other base pairs in a duplex (31). If there is local melting, more parameters including open base pairs will also be in the constraint equations. Furthermore, if contributions of nearest neighbors are assumed to be independent of their locations, they can be grouped, and then the first two constraints hold. The third constraint simply states the fact that, for each sequence, there is always an initiation term and two ends if we assume a sequence-independent initiation. Due to the three constraints, only 12 of the 15 parameters can be uniquely determined unless there is independent knowledge of the values of some parameters or relationships between them (10, 11, 13, 14). The constraints above can be relieved by making reasonable assumptions of the values of some nearest neighbors or relationships between some of them. Note that, to relieve a constraint, one has to assume a relationship between a parameter of positive sign and a parameter of negative sign in the same constraint equation. Each assumption will reduce the number of parameters by 1. The specific interpretations of the parameters, however, depend on the assumptions that are made. For purposes of predicting thermodynamic properties of RNA Watson-Crick duplexes, its not necessary to have unique knowledge of the actual initiation and end effects. In principle, separation of end effects from initiation requires a nonlinear length dependent function for initiation to relieve the third constraint. We attempted this with a logarithmic dependence for initiation on length (46-48), but the error in the initiation term is large due to the relatively narrow length range (results not shown). Thus much longer sequences have to be measured, which could impose experimental difficulties due to the high TM and likely non-two-state behavior. Since the choices of assumptions are arbitrary, any 12 parameter INN model of this type will fit the data in Table 1 as well as the INN-HB model and give identical predictions for any sequence. The sum of values for two nearest neighbors of the same composition but of reversed order of 5CA3 and 5AC3, etc., is the two base pairs, for example, 3 GU5 3UG5 always a constant for a given database. Except for the two AA3 5GG3 like-neighbors, 5 3UU5 and 3CC5 , the partitioning of values of individual parameters, however, depends on the choice of assumptions for relationships between parameters. U3 G3 For example, if it is assumed that terminal A3 U5 ) A5, C5 C3, and the two equalized GC ends are combined with ) G5 initiation, all of the 12 parameters have values identical to those of the INN-HB model; if it is assumed that the U3 G3 parameters for terminal A3 U5, A5, and C5 are all equal and C3, then fitting the data in Table 1 different from terminal G5 gives a G37 of -0.90, 4.99, -3.25, and -2.52 kcal/mol

We also recognized a third constraint:


A3 U3 G3 C3 2Ninit - N(U5 ) - N(A5) - N(C5 ) - N(G5) ) 0 (22)

14732 Biochemistry, Vol. 37, No. 42, 1998 C3 end, initiation, 5CG3, and 5GC3 parameters, for the G5 3GC5 3CG5 CG3 respectively. Thus with these assumptions, a 5 3GC5 nearGC3 est neighbor is more favorable than a 5 3CG5 nearest neighbor. This is opposite of the trend observed with the INN-HB model (Table 4). Nevertheless, both models fit the data in Table 1 equally well and give identical predictions for any sequence. Similarly, an equal fit to the data is obtained with the ISS model of Gray (10). The ISS model was developed to deal with the possibility that terminal AU base pairs differ from terminal GC base pairs because they interact differently with their environment, for example, with the solvent (10, 13). Parameters for the ISS model of Gray (10) can be constructed from parameters of the INN-HB model. For example, the EAGE EUCE parameter in the ISS model can be derived by EAGE EUCE ) initiation + 5AG3 + 5A, where E is end. Each ISS parameter has the 3UC5 3U initiation term, one INN-HB nearest-neighbor term, and appropriate terminal AU term(s). That is, the physical meanings of ISS parameters are that they represent the change of thermodynamic properties by forming a base pair doublet from dinucleotide single strands except that the symmetry term has not been added to self-complementary dimers (10). As mentioned above, the ISS model gives exactly the same statistics and makes exactly the same predictions as the INN-HB model. An advantage of the ISS model is that it assumes nothing beyond nearest-neighbor behavior. A disadvantage is that no unique values can be attributed to the parameters other AA3 5GG3 than 5 3UU5 and 3CC5 (10). In contrast, all of the individual nearest-neighbor parameters for the INN-HB model have unique values with physical meaning, if the interpretation is correct that the newly added term arises from hydrogen bonding and that terminal and internal base pairs are thermodynamically equivalent. As described above, comparisons with hairpin and internal loop thermodynamics are consistent with the INN-HB physical model, suggesting that the thermodynamics of terminal base pairs are not substantially perturbed by end effects. If the assumptions of the INN-HB model are incorrect, however, then the values AA3 5GG3 for parameters other than 5 3UU5 and 3CC5 do not reflect the relative stabilities of various base pair doublets. Comparisons of free-energy increments measured for terminal dangling nucleotides in oligonucleotides with threedimensional X-ray structures of large RNAs also suggest that thermodynamics are not substantially perturbed by end effects (1, 87, 88). In general, dangling ends that favor oligonucleotide duplex formation by more than 1 kcal/mol are found stacked in structures of large RNAs. Dangling ends that favor duplex formation by less than 0.4 kcal/mol are often unstacked in large RNAs unless involved in hydrogen-bonded GA pairs. The environments around helix ends in oligoribonucleotides and large RNAs are quite different. Thus it would be surprising to see such correlations if end effects are substantial. Application of INN-HB Model. For nucleic acids, most thermodynamic studies of oligonucleotide duplexes are done in 1 M NaCl. Although this is not a physiological salt concentration, duplex stabilities are similar to those observed

Xia et al.

FIGURE 2: Calculation of thermodynamic properties for a non-selfACGAGC3 complementary duplex, 5 3UGCUCG5, using parameters of INNHB model. Note that non-self-complementary nearest neighbors 5AC3 ) 5GU3, have two equivalent orientations, for example, 3 UG5 3CA5 while self-complementary nearest neighbors have only one orientaCG3 5A tion, for example, 5 3GC5. The term 3U represents the terminal AU term. TM is calculated from the predicted H and S values using eq 3 with a total strand concentration of 0.2 mM. Note that the H value must be multiplied by 1000 to match the units for S and R, the gas constant.

in the presence of about 0.1 M Na+ or K+ and a few millimolar Mg2+, which is a physiological condition (30, 34, 84, 89). For 1:1 electrolytes such as NaCl, Mannings polyelectrolyte theory predicts that, due to counterion condensation, the local concentration of monovalent cation near a polynucleotide duplex is around 1 M regardless of the bulk salt concentration (90). For oligonucleotide duplexes, the local cation concentration will depend on duplex length. This length effect, however, is negligible at high salt concentration (39, 91). Therefore, thermodynamic information derived at 1 M NaCl is useful for understanding RNA properties in biologically relevant environments. To use parameters of the INN-HB model to predict thermodynamic properties for an RNA sequence by eq 14 and analogous equations for H and S, the sequence must be represented in terms of these parameters. Thus all of the nearest neighbors, terminal base pairs, and symmetry of the sequence must be included. As shown in Table 4, selfcomplementary sequences require nonzero symmetry terms for calculating G37 and S, but not H. For each terminal AU base pair, the terminalAU term is added. There is no end correction term for terminal GC base pairs. Sample calculations of thermodynamic parameters for duplex formation for a non-self-complementary and a self-complementary duplex are illustrated in Figures 2 and 3, respectively.

RNA Nearest-Neighbor Parameters

Biochemistry, Vol. 37, No. 42, 1998 14733 about 0.5 kcal/mol. With this correction, there is an enhancement of stability in G37 of the range of 0.8-1.5 kcal/mol from coaxial stacking compared to the corresponding Watson-Crick nearest-neighbor interactions. Because the number of hydrogen bonds is the same for WatsonCrick nearest neighbors in a continuous helix and a coaxial stack, this enhancement is presumably because the break in the backbone on one strand offers more flexibility to the base pairs at the interface thus maximizing stacking interactions. Similarly, in multibranch loops, helixes that end with an AU pair at the junction should be penalized by about 0.5 kcal/mol. Modifications of parameters based on the work reported here can be easily incorporated into existing structure prediction algorithms (94, 96, 97). ACKNOWLEDGMENT We thank Prof. Donald Gray for sharing his manuscripts before publication. Prof. Gray, Dr. S. M. Freier, and David H. Mathews provided stimulating discussions and comments on the manuscript. We also thank D. Leffel, T. W. Barnes, and X. Chen for providing thermodynamic measurements. SUPPORTING INFORMATION AVAILABLE One table showing thermodynamic parameters for duplex formation derived from 1/TM vs log(CT/a) and from the average of curve fittings for 99 two-state sequences and 10 non-two-state sequences, one table showing the nearestneighbor parameters for INN and INN-24 models, a discussion of multiple linear regression analysis by singular-value decomposition, and three tables of variance-covariance matrixes for parameters of the INN-HB model (17 pages). REFERENCES
1. Turner, D. H., Sugimoto, N., and Freier, S. M. (1988) Annu. ReV. Biophys. Biophys. Chem. 17, 167-192. 2. Michel, F., and Westhof, E. (1990) J. Mol. Biol. 216, 585610. 3. Sugimoto, N., Kierzek, R., and Turner, D. H. (1988) Biochemistry 27, 6384-6392. 4. Bevilacqua, P. C., and Turner, D. H. (1991) Biochemistry 30, 10632-10640. 5. Pyle, A. M., and Cech, T. R. (1991) Nature 350, 628-631. 6. Li, Y., Bevilacqua, P. C., Mathews, D. H., and Turner, D. H. (1995) Biochemistry 34, 14394-14399. 7. Gluick, T. C., and Draper, D. (1994) J. Mol. Biol. 241, 246262. 8. Borer, P. N., Dengler, B., Tinoco, I., Jr., and Uhlenbeck, O. C. (1974) J. Mol. Biol. 86, 843-853. 9. Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, N., Caruthers, M. H., Neilson, T., and Turner, D. H. (1986) Proc. Natl. Acad. Sci. U.S.A. 83, 9373-9377. 10. Gray, D. (1997) Biopolymers 42, 783-793. 11. Gray, D. (1997) Biopolymers 42, 795-810. 12. Kierzek, R., Caruthers, M. H., Longfellow, C. E., Swinton, D., Turner, D. H., and Freier, S. M. (1986) Biochemistry 25, 7840-7846. 13. Goldstein, R., and Benight, A. (1992) Biopolymers 32, 16791693. 14. Gray, D., and Tinoco, I., Jr. (1970) Biopolymers 9, 223-244. 15. Allawi, H. T., and SantaLucia, J., Jr. (1997) Biochemistry 36, 10581-10594. 16. SantaLucia, J., Jr. (1998) Proc. Natl. Acad. Sci. U.S.A. 95, 1460-1465. 17. Sugimoto, N., Honda, K., and Sasaki, M. (1994) Nucleosides Nucleotides 13, 1311-1317.

FIGURE 3: Calculation of thermodynamic properties for a selfUGGCCA3 complementary duplex, 5 3ACCGGU5, using parameters of the INN-HB model. Note that the symmetry term is nonzero for G and S. Note that non-self-complementary nearest neighbors have UG3 5CA3 two equivalent orientations, for example, 5 3AC5 ) 3GU5, while self-complementary nearest neighbors have only one orientation, GC3 A3 for example, 5 3CG5. The term U5 represents the terminal AU term. TM is calculated from the predicted H and S values using eq 3 with a total strand concentration of 0.1 mM. Note that the H value must be multiplied by 1000 to match the units for S and R, the gas constant.

Results of calculations based on Table 4 are presented in Table 1 for all sequences studied. Uncertainties in the predicted values can be calculated using the variancecovariance matrixes listed in Supporting Information to estimate the variances of predicted values for G37, H, and S. Then eq 12 or 13 can provide an estimate of uncertainty in TM. Application to RNA Secondary Structure Prediction. The free-energy increments for helix propagation in Table 4 differ from those of Freier et al. (9) by at most 0.4 kcal/mol, partly attributable to the enlarged database and partly due to implementation of the INN-HB model. Differences due to the INN-HB model are already largely included in algorithms for prediction of RNA secondary structure since hairpins and internal loops are penalized for AU closing base pairs (92-94). Parameters for coaxial stacking, however, were determined by comparing stabilities of duplexes formed adjacent to pre-existing hairpin stems with stabilities predicted for isolated duplexes (95). The results presented here suggest that the stabilities of the isolated duplexes were overestimated by about 0.5 kcal/mol when there is a terminal AU pair in the duplex binding at the helix/helix interface. Thus the free-energy contribution calculated for coaxial stacking in these cases should be made more favorable by

14734 Biochemistry, Vol. 37, No. 42, 1998


18. Sugimoto, N., Nakano, S., Katoh, M., Matsumura, A., Nakamuta, H., Ohmichi, T., Yoneyama, M., and Sasaki, M. (1995) Biochemistry 34, 11211-11216. 19. Beaucage, S. L., and Caruthers, M H. (1981) Tetrahedron Lett. 22, 1859-1863. 20. McBride, L. J., and Caruthers, M. H. (1983) Tetrahedron Lett. 24, 245-249. 21. Usman, N., Ogilvie, K. K., Jiang, M.-V., and Cedergren, R. (1987) J. Am. Chem. Soc. 109, 7845-7854. 22. Wincott, F., DiRenzo, A., Shaffer, C., Grimm, S., Tracz, D., Workman, C., Sweedler, D., Gonzalez, C., Scaringe, S., and Usman, N. (1995) Nucleic Acids Res. 23, 2677-2684. 23. Stawinski, J., Stro mberg, R., Thelin, M., and Westman, E. (1988) Nucleosides Nucleotides 7, 779-782. 24. Chou, S.-H., Flynn, P., and Reid, B. (1989) Biochemistry 28, 2422-2435. 25. Borer, P. N. (1975) in Handbook of Biochemistry and Molecular Biology: Nucleic Acids (Fasman, G. D., Ed.) 3rd ed., Vol. I, p 589, CRC Press, Cleveland, OH. 26. Richards, E. G. (1975) in Handbook of Biochemistry and Molecular Biology: Nucleic Acids (Fasman, G. D., Ed.) 3rd ed., Vol. I, p 197, CRC Press, Cleveland, OH. 27. Peritz, A. E., Kierzek, R., Sugimoto, N., and Turner, D. H. (1991) Biochemistry 30, 6428-6436. 28. Albergo, D. D., Marky, L. A., Breslauer, K. J., and Turner, D. H. (1981) Biochemistry 20, 1409-1413. 29. Petersheim, M., and Turner, D H. (1983) Biochemistry 22, 256-263. 30. McDowell, J. A., and Turner, D. H. (1996) Biochemistry 35, 14077-14089. 31. Cantor, C. R., and Schimmel, P. R. (1980) Biophysical Chemistry, Part III, W. H. Freeman & Co., San Francisco, CA. 32. SantaLucia, J., Jr. (1991) Ph.D. Thesis, University of Rochester, New York. 33. SantaLucia, J., Jr., Kierzek, R., and Turner, D. H. (1991) J. Am. Chem. Soc. 113, 4313-4322. 34. Xia, T., McDowell, J. A., and Turner, D. H. (1997) Biochemistry 36, 12486-12497. 35. Bevington, P. R., and Robinson, D. K. (1992) Data Reduction and Error Analysis for the Physical Sciences, 2nd ed., McGraw-Hill, New York. 36. Meyer, S. L. (1975) Data Analysis for Scientists and Engineers, Wiley, New York. 37. Krug, R. R., Hunter, W. G., and Grieger, R. A. (1976) J. Phys. Chem. 80, 2335-2341. 38. Snedecor, G. W., and Cochran, W. G. (1982) in Statistical Methods, 7th ed., p 189, The Iowa State University Press, Ames, IA. 39. SantaLucia, J., Jr., and Turner, D. H. (1997) Biopolymers 44, 309-319. 40. Wilson, E. B., Jr. (1952) An Introduction to Scientific Research, McGraw-Hill, New York. 41. Pyle, A. M., Moran, S., Strobel, S. A., Chapman, T., Turner, D. H., and Cech, T. R. (1994) Biochemistry 33, 13856-13863. 42. Narlikar, G. J., Khosla, M., Usman, N., and Herschlag, D. (1997) Biochemistry 36, 2465-2477. 43. Gralla, J., and Crothers, D. M. (1973) J. Mol. Biol. 73, 497511. 44. Cantor, C. R., and Tinoco, I., Jr. (1965) J. Mol. Biol. 13, 6577. 45. Cantor, C. R., Jaskunas, S. R., and Tinoco, I., Jr. (1966) J. Mol. Biol. 20, 39-62. 46. Lewis, G. N., and Randall, M. (1961) Thermodynamics, 2nd ed., McGraw-Hill, New York. 47. Finkelstein, A. V., and Janin, J. (1989) Protein Eng. 3, 1-3. 48. Williams, D. H., Cox, J. P. L., Doig, A. J., Gardner, M., Gerhard, U., Kaye, P. T., Lal, A. R., Nicholls, I. A., Salter, C. J., and Mitchell, R. C. (1991) J. Am. Chem. Soc. 113, 70207030. 49. Bailey, W. F., and Monahan, A. S. (1978) J. Chem. Educ. 55, 489-493. 50. Seber, G. A. F. (1977) Linear Regression Analysis, John Wiley & Sons, Inc., New York.

Xia et al.
51. Neter, J., Wasserman, W., and Kutner, M. H. (1985) Applied Statistical Models, 2nd ed., Richard D. Irwin, Inc., Homewood, IL. 52. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992) Numerical Recipes in C, 2nd ed., Cambridge University Press, New York. 53. Wolfram, S. (1996) MATHEMATICA version 3.0, Wolfram Research, Inc. 54. SantaLucia, J., Jr., Allawi, H. T., and Seneviratne, P. A. (1996) Biochemistry 35, 3555-3562. 55. SAS Institute Inc. (1990) Cary, NC. 56. Freier, S. M., Burger, B. J., Alkema, D., Neilson, T., and Turner, D. H. (1983) Biochemistry 22, 6198-6206. 57. Freier, S. M., Alkema, D., Sinclair, A., Neilson, T., and Turner, D. H. (1985) Biochemistry 24, 4533-4539. 58. Hickey, D. R., and Turner, D. H. (1985) Biochemistry 24, 2086-2094. 59. Turner, D. H., and Bevilacqua, P. C. (1993) in The RNA World (Gesteland, R. F., and Atkins, J. F., Eds.) Cold Spring Harbor Laboratory Press, New York. 60. Turner, D. H. (1999) in Nucleic Acids: Structure, Properties, and Functions (Bloomfield, V. A., Crothers, D. M., and Tinoco, I., Jr., Eds.) University Science Books, Mill Valley, CA. 61. Doktycz, M. J., Morris, M. D., Dormady, S. J., Beattie, K. L., and Jacobson, K. B. (1995) J. Biol. Chem. 270, 8439-8445. 62. Wada, A., Yubuki, S., and Husimi, Y. (1980) Crit. ReV. Biochem. 9, 87-144. 63. Gotoh, O. (1983) AdV. Biophys. 16, 1-52. 64. Wartell, R. M., and Benight, A. S. (1985) Phys. Rep. 126, 67-107. 65. Schmitz, M., and Steger, G. (1992) Comput. Appl. Biosci. 8, 389-399. 66. Bloomfield, V. A., Crothers, D. M., and Tinoco, I., Jr. (1974) Physical Chemistry of Nucleic Acids, Harper and Row, New York. 67. Poland, D., and Scheraga, H. (1970) Theory of Helix-Coil Transitions in Biopolymers, Academic Press, New York. 68. Poland, D. (1978) CooperatiVe Equilibria in Physical Biochemistry, Clarendon Press, Oxford, U.K. 69. Steger, G. (1994) Nucleic Acids Res. 22, 2760-2768. 70. Watson, J. D., Hopkins, N. H., Roberts, J. W., Steitz, J. A., and Weiner, A. M. (1987) Molecular Biology of the Gene, Benjamin Cummings, Inc., Menlo Park, CA. 71. Lewin, B. (1997) Genes VI, Oxford University Press, Oxford, U.K. 72. Gesteland, R. F., and Atkins, J. F. (1993) The RNA World, Cold Spring Harbor Laboratory Press, New York. 73. Freier, S. M., Sugimoto, N., Sinclair, A., Alkema, D., Neilson, T., Kierzek, R., Caruthers, M. H., and Turner, D. H. (1986) Biochemistry 25, 3214-3219. 74. Turner, D. H., Sugimoto, N., Kierzek, R., and Dreiker, S. D. (1987) J. Am. Chem. Soc. 109, 3783-3785. 75. Martin, F. H., Castro, M. M., Aboul-ela, F., and Tinoco, I., Jr. (1985) Nucleic Acids Res. 13, 8927-8938. 76. Aboul-ela, F., Koh, D., Tinoco, I., Jr., and Martin, F. H. (1985) Nucleic Acids Res. 13, 4811-4824. 77. Kawase, Y., Iwai, S., Inoue, H., Miura, K., and Ohtsuka, E. (1986) Nucleic Acids Res. 14, 7727-7736. 78. Gaffney, B. L., Marky, L. A., and Jones, R. A. (1984) Tetrahedron 40, 3-13. 79. Serra, M., Barnes, T., Betschart, K., Gutierrez, M. J., Sprouse, K. J., Riley, C. K., Stewart, L., and Temel, R. E. (1997) Biochemistry 36, 4844-4851. 80. Wu, M., McDowell, J. A., and Turner, D. H. (1995) Biochemistry 34, 3204-3211. 81. SantaLucia, J., Jr., and Turner, D. H. (1993) Biochemistry 32, 12612-12623. 82. Wu, M., and Turner, D. H. (1996) Biochemistry 35, 96779689. 83. Wu, M., SantaLucia, J., Jr., and Turner, D. H. (1997) Biochemistry 36, 4449-4460. 84. McDowell, J. A., He, L., Chen, X., and Turner, D. H. (1997) Biochemistry 36, 8030-8038.

RNA Nearest-Neighbor Parameters


85. Wang, S., and Kool, E. T. (1995) Biochemistry 34, 41254132. 86. Vologodskii, A. V., Amirikyan, B. R., Lyubchenko, Y. L., and Frank-Kamenetskii, M. D. (1984) J. Biomol. Struct. Dyn. 2, 131-148. 87. Sugimoto, N., Kierzek, R., and Turner, D. H. (1987) Biochemistry 26, 4554-4558. 88. Burkard, M. E., Turner, D. H., and Tinoco, I., Jr. (1998) in The RNA World II (Cech, T. R., Gesteland, R. F., and Atkins, J. F., Eds.) Cold Spring Harbor Laboratory Press, New York (in press). 89. Williams, A. P., Longfellow, C. E., Freier, S. M., Kierzek, R., and Turner, D. H. (1989) Biochemistry 28, 4283-4291. 90. Manning, G. (1978) Q. ReV. Biophys. 11, 179-246. 91. Olmsted, M. C., Anderson, C. F., and Record, M. T., Jr. (1989) Proc. Natl. Acad. Sci. U.S.A. 86, 7766-7770. 92. Walter, A. E., Turner, D. H., Kim, J., Lyttle, M. H., Mu ller, P., Mathews, D. H., and Zuker, M. (1994) Proc. Natl. Acad. Sci. U.S.A. 91, 9218-9222. 93. Serra, M., and Turner, D. H. (1995) Methods Enzymol. 259, 242-261. 94. Mathews, D. H., Andre, T. C., Kim, J., Turner, D. H., and Zuker, M. (1998) in Molecular Modeling of Nucleic Acids (Leontis, N. B., SantaLucia, J., Eds.) ACS Symposium Series 682, American Chemical Society, Washington, D.C., pp 246257.

Biochemistry, Vol. 37, No. 42, 1998 14735


95. Walter, A. E., and Turner, D. H. (1994) Biochemistry 33, 12715-12719. 96. Zuker. M., and Stiegler, P. (1981) Nucleic Acids Res. 9, 133148. 97. Zuker, M. (1989) Science 244, 48-52. 98. Freier, S. M., Sinclair, A., Neilson, T., and Turner, D. H. (1985) J. Mol. Biol. 185, 645-647. 99. Hall, K., and McLaughlin, L. W. (1991) Biochemistry 30, 10606-10613. 100. He, L., Kierzek, R., SantaLucia, J., Jr., Walter, A. E., and Turner, D. H. (1991) Biochemistry 30, 11124-11132. 101. Longfellow, C. E., Kierzek, R., and Turner, D. H. (1990) Biochemistry 29, 278-285. 102. Sugimoto, N., Kierzek, R., Freier, S. M., and Turner, D. H. (1986) Biochemistry 25, 5755-5759. 103. Nelson, J. W., Martin, F. H., and Tinoco, I., Jr. (1981) Biopolymers 20, 2509-2531. 104. Freier, S. M., Petersheim, M., Hickey, D. R., and Turner, D. H. (1984) J. Biomol. Struct. Dyn. 1, 1229-1242. 105. Walter, A. E., Wu, M., and Turner, D. H. (1994) Biochemistry 33, 11349-11354.

BI9809425

You might also like