You are on page 1of 24

BioSystems, 29 (1993) 105-128 105

Elsevier Scientific Publishers Ireland Ltd.

A linguistic representation of the regulation of transcription


initiation. II. Distinctive features of sigma 70 promoters and
their regulatory binding sites

Julio Collado-Vides
Centro de Investigacidn sobre Fijacidn de Nitrdgeno, Universidad Nacional Autdnoma de Mdxico, Cuernavaca A.P. 565-.4,
Morelos 62271 (Mdxico)

(Received December 5th, 1992)

The goal of this paper and the accompanying one is to achieve a linguistic representation of a set of sigma 70 promoters.
Such a description is formed by an ordered concatenated array of complex symbols identified by their categorical property,
i.e. promoter, operator, activator binding site, etc. Each of these symbols may contain several properties associated with their
respective classes of 'molecular words'. The main problem in attaining such a description is to define which properties are going
to be represented, and how. In the accompanying paper the criteria on which the selection of alternative descriptions is based
were discussed. The properties of promoters and regulatory sites are discussed here, and their corresponding distinctive fea-
tures are selected following such criteria. Thus, information that is not directly relevant and that can overspecify the descrip-
tion has been excluded, since it does not seem to contribute to identifying classes of substitutable elements. Other properties,
such as strength of promoters, position of regulatory sites, different types of specificities of regulatory proteins, affinity of their
binding sites, etc., are also discussed. As a result of this analysis, a complete representation with distinctive features of the
set of sigma 70 promoters is attainable.

Key words: Linguistic formalisation; Gene regulation; Bacterial promoters

1. Introduction in molecular biology. Certainly the catalogue of


promoters raised the question of how to
This and the accompanying paper (Part I) con- organize such a collection of different arrays
tain a linguistic analysis that is required for the under a common set of principles. The question
further development of a grammar of sigma 70 is how similar types of data sets can be used to
promoters regulated at the initiation of tran- construct a linguistic formalisation where all the
scription (Collado-Vides, 1992). We refer to pertinent properties of gene regulation are
operons, transcription units, promoters and made explicit.
structural genes and their closely associated The linguistic approach enables us to address
regulatory sites as units of genetic information in a systematic way the question of how the
(UGIs). huge amount of already deciphered information
The question of constructing grammatical can be integrated in order to enhance our bio-
models or grammars illustrates, in a smaller logical understanding. These two papers are
scale, the problem of integration of information devoted to defining a transcript, or representa-
tion of UGIs, into elements that are pertinent to
the functioning of the sigma 70 reading appara-
Correspondence to: Julio Collado-Vides, Centro de Investiga-
ci6n sobre Fijaci6n de Nitr6geno, Universidad Nacional tus. We call it a linguistic L1 level of represen-
AutSnoma de M~xico, Cuernavaca A.P. 565-A, Morelos tation of UGIs. The identification of such
62271, M~xico. representation consists in selecting which infor-
106

mation is pertinent to the regulatory description lation'. The latter proposal assumes separate
of UGIs, and which is the best way to describe it. grammars for each set of UGIs of the different
The purpose of this paper is to identify the set 'reading systems'. Compare, for instance, the
of distinctive features, of promoters (Pr), set of sigma 70 UGIs with that of sigma 54
operators (Op) and activator binding sites (I) of UGIs. They have quite different properties in
sigma 70 promoters. This identification is based terms of the mechanisms and distribution of
on the criteria stipulated in the previous paper. regulatory sites, and the assumptions on the
As mentioned before, the set of distinctive fea- conditions for their regulation are also different
tures of molecular categories is going to include (Collado-Vides et al., 1991). The search for dif-
properties of the nucleotide sequences of DNA ferent grammars for each set seems a more ade-
as well as properties of the respective proteins. quate approach than trying to integrate all the
principles and restrictions of occurrence of
2. Distinctive features of the Pr category sigma 70 and sigma 54 arrays into a single
grammar. Thus, considering that we are in-
In the previous paper we determined the way vestigating the grammar of sigma 70 pro-
information on co-ordinates of sites is included moters, there is no need to repeat each time that
as a distinctive feature. In this section we will promoters are sigma 70 promoters.
discuss the way several properties of promoter However, the striking differences between
categories are incorporated as distinctive fea- sigma 70 and sigma 54 may not be the general
tures. We will discuss differences in sigma fac- rule across different bacterial factors (Helmann
tor specificity, and will analyse whether notions and Chamberlin, 1988), and if it is found conve-
such as lac, narG and malE (see (10), (11) and nient to group into a single grammar UGIs with
(13) in the accompanying paper) that identify different sigma factor specificity, then such pro-
specific promoters can qualify as distinctive fea- perty would have to be indicated in each
tures. Other features to be discussed are those promoter.
used in the classification of sigma 70 promoters Incidentally, observe that the notion of 'lan-
in Collado-Vides et al., 1991. guage' as the collection of all possible UGIs
The L1 representation can be understood as grouped into a single theory or grammar is more
the theoretical transcription of the properties akin to different molecular systems of transcrip-
used in the 'interpretation' of sequences of DNA tion and regulation than to sets of UGIs specific
assigned by 'reading systems'. These 'reading of particular species. Based on the similarity of
systems' correspond to different RNA poly- the transcription and regulation machinery,
merases with their associated set of regulatory sigma 70 UGIs of E. coli and Salmonella are
proteins and their respective interactions. Diffe- considered under similar principles. And, in a
rent sigma factors provide a different specifi- similar way, sigma 54 UGIs of E. coli, Salmonel-
city to the bacterial RNA core polymerase for re- la and Klebsiella can be studied as part of a sin-
coguizing different DNA sequences as their res- gle 'language' (Collado-Vides et al., 1991).
pective promoters (Helmann and Chamberlin, Thus, the species identification is not required
1988). There is no question that information within the set of sigma 70 or within the set of
that specifies which RNA polymerase is involved sigma 54 promoters. Elimination of species
in the transcription of a UGI is relevant to its identification assumes that either there are no
description, since all the promoters of the collec- structural characteristics within the sequences
tion are identified as sigma 70 promoters. of promoters that would enable one to separate
The question is, where to locate this informa- the promoters of different species within a
tion? There are two alternatives: either it is a sigma class or that if there are, they do not in-
distinctive feature of promoter categories or it is terfere with the substitutability of promoters.
included within the general principles that Let us return now to analyse whether 'lac',
characterize sets of UGIs or 'languages of regu- 'araC' or 'bioB' are distinctive features (see,
107

e.g., (10) in the accompanying paper). These promoters are usually weak promoters (De
names identify specific promoters; in other Crombrugghe et al., 1982). An extreme illustra-
words, such symbols are shorthand representa- tion of this correlation is the example of an acti-
tions of complete nucleotide sequences. Their vator that can become a repressor by the
use in a complete representation seems to con- exclusive modification of the strength of the pro-
tradict the aim of defining the minimal proper- moter (Tsung et al., 1990). The properties of
ties required to distinguish them. Certainly one strong (+) and weak (-) are closer to the notion
can expect that a smaller number of nucleotides of a distinctive feature than the classes of 'pro-
could suffice to distinguish any two promoters of moters' previously defined.
the collection of sigma 70 promoters. In addi- Promoter strength correlates with the se-
tion, for the sake of regulation, such specific se- quence of the promoter. Most of the mutations
quences are accidental, in the sense that there that make a sequence closer to the consensus
are other sequences that can replace them are up mutations, and most of those that make
without significantly modifying the regulation the sequence less similar to the consensus are
of the respective UGI. down mutations (Hawley and McClure, 1983).
We have then to widen the notion of promoter Thus, modifications either of the distance be-
'X' in order to include the set of in- tween the conserved domains, or of a nucleotide
distinguishable sequences into a single distinc- in either the -10 or the -35 domain can give
tive feature. For instance, a different notion of equivalent results in terms of promoter
promoter X, let us denote it by 'promoter X', can strength. Partitioning of promoters into dif-
be defined in such a way as to include all the se- ferent classes according to their strength does
quences that have the same -10 and -35 se- not require to identify promoters in terms of the
quences and the same distance in between them specific sequences in the conserved -35 and -10
as the native X promoter. As a first approxima- domains.
tion, any sequence that satisfies these require- Observe that the 'strength' of promoters can
ments can be substituted for the native be decomposed into its two components, the
promoter X. However, even this extended no- equilibrium dissociation constant Kd, and the ki-
tion of 'promoter' is too specific to be considered netic constant of transition to open complex
a distinctive feature. If promoters X and Y do (McClure, 1985). Features of strength of inter-
not differ significantly, there is no need to action can be decomposed into simpler parame-
associate to them a different distinctive feature. ters until each mechanism of regulation is
All the promoter sequences that are in- described at the level of elementary chemical
distinguishable in terms of regulation have to be steps (Eyring and Eyring, 1963). However, the
grouped into a single class. Such classes do not condition of simplicity of representation imposes
depend exclusively on the conservation of those a limit to what is identified as a relevant feature,
specific nucleotides that enable us to distinguish preventing the overspecification of features.
between 'promoters'. For instance, sigma 70 E. The incorporation of these two parameters
coli promoters with different -35 domains can within the set of distinctive features is justified
have the same regulatory behaviour (see pro- if they help to distinguish different classes of
moters 14, 15 and 18 in Tsung et al., 1990). Con- promoters, where such classes are useful in a de-
servation of promoter strength can be assumed scription that emphasizes regulatory properties
to be more important than conservation of speci- of UGIs.
fic sequences in the -10 and -35 domains. A correlation has been found between consen-
There is no question that the distinction at sus in the -10 domain and binding of RNA poly-
least between strong and weak promoters is rel- merase, and consensus in the -35 domain and
evant for purposes of a regulatory description. isomerization kinetics (Studnicka, 1988). But if
Negatively regulated promoters are usually these classes do not correlate with positive and
strong promoters, whereas positively regulated negative regulation, there is no motivation to
108

consider them as distinctive features. Recall impose restrictions on the substitutability of


that activatable promoters can have a perfect promoters' sequences under the test of
consensus -35 sequence (Tsung et al., 1990). regulatability. There is no reason to expect that
Based on the data and analyses available, there the nucleotide sequence of a simple or multiple
are no reasons to consider the binding constant promoter would fail to replace another promoter
and the isomerization constant as two separate because of these properties. Recall that the dic-
features. See also the discussion of affinity of tionary in the grammar contains the complete
protein-DNA interaction in Section 3.1. sequence of promoters, irrespective of overlapp-
The number of classes of affinity for a certain ing domains with other categories.
type of promoter should ideally reflect the num- Alternatively, the character of simple, multi-
ber of biologically significant classes. For the ple or complex promoters could be considered as
moment we assume two values, (+) for strong a property of complete UGIs and not of indivi-
and (-) for weak promoters, to be assigned to the dual Pr categories. Thus a UGI can be defined
distinctive feature of 'affinity' or 'strength' of as simple if it has a single promoter regulated by
promoters. a single system of regulation. A multiple UGI is
Strength is not the only feature required for a one with several promoters, and a UGI is com-
satisfactory description of promoters. There are plex either when it is multiple or when its regu-
other properties, which do not necessarily de- lation involves several systems of regulation.
pend on the internal structure of the sequences The location of these properties as properties of
involved in the binding of RNA polymerase and complete UGIs is based on the assumption,
which might need to be taken into account. previously mentioned, that there is no informa-
There is evidence indicating that downstream tion in the sequence of promoters that is associ-
sequences close to the promoters can play a role ated with such properties. Furthermore, these
in the efficiency of termination of transcription properties correlate with regulatory properties
of UGIs (Telesnitsky and Chamberlin, 1989). of UGIs. It has been observed that duplicated re-
When considering not only transcription initia- mote sites generally occur in arrays with multi-
tion but regulation of the UGI at different lev- ple promoters subject to multiple systems of
els, a 'termination' feature with two or more regulation (Collado-Vides et al., 1991).
values indicating the different conformations of In a similar way, supercoiling of the template
RNA polymerase should be included in more cannot usually be associated with one particular
complete grammatical models. molecular category. Together with the
In the review of the database of sigma 70 character of multiple or simple UGIs, it has to be
UGIs it was useful to distinguish between 'multi- incorporated at the level of complete UGIs. Cer-
ple', 'complex' and 'simple' promoters. Pro- tainly it has been observed that mechanisms of
moters are called 'multiple' if their regulatory regulation that involve binding sites of regula-
region is connected to that of another promoter tory proteins located at a distance frequently
and both are controlled by at least one common work better with supercoiled DNA than in linear
element or transcribe the same gene. Promoters templates (Borowiec et al., 1987; Tobin and
that are either multiple or subject to several sys- Schleif, 1990; Richet and Raibaud, 1991).
tems of regulation are called 'complex'. Finally, The last feature of promoters that needs to be
those that are not complex are called simple pro- discussed again is that of information on co-
moters (Collado-Vides et al., 1991). The proper- ordinates. Briefly, recalling what has been
ty 'multiple' with a (+) feature for multiple discussed in the previous paper, the reasons not
promoters and a (-) feature for non-multiple pro- to include co-ordinates in promoters are the
moters could be proposed. following: (i) the precise location of all regula-
However, observe that these properties are tory sites can be deduced from the distance of
completely different from strength and termina- remote sites and co-ordinates of proximal sites,
tion properties, in the sense that they do not and (ii) the information on the boundaries of -65
109

to + 20 of the promoter is already used in the lecular words, each one describing a site of in-
principle of proximal precedence of sigma 70 teraction with one protein. More precisely, each
UGIs. DNA-protein specificity would correspond to a
In summary, specific names such as 'lac' and group of molecular words, those that share that
'bioB' (or their complete nucleotide sequences) specificity. However, since proteins contain
do not satisfy the conditions to be considered several other properties, this specificity is only
distinctive features of promoters. A strength or one among other properties that identify groups
affinity property with two or more values for of words. To name a protein means to refer to
different degrees of strength is a distinctive pro- a considerable number of properties that, to a
perty of promoters. Other properties may also certain extent, can be modified independently.
be considered in more detailed descriptions of For instance, when referring to a protein we are
promoters, such as a 'termination' property. referring to at least three different separable
Properties of 'multiple' and 'subject to multiple domains of specificity: the domain of protein-
regulation' are not included as features of pro- DNA interaction; the domain of recognition of
moters, assuming they do not affect the the allosteric signal, or that of the equivalent
substitutability of promoters. However, they covalent modification; and the domain that par-
might well be located as properties of complete ticipates in the formation of dimers and other
UGIs, in the same way as the degree of super- multimers with homologous proteins. Finally, in
coiling of the DNA template. Finally, the co- the case of activators, as well as in some
ordinates of individual promoters are not includ- repressors (Menon and Lee, 1990), another site
ed as distinctive features. The number of values may be involved in the interaction with RNA
associated with each feature, as mentioned polymerase. As mentioned before, the L1
before, should reflect the number of classes that representation of 'deciphered' DNA sequences
are biologically significant. We will assume two has to include properties of regulatory proteins
classes of affinity, strong and weak promoters. (Collado-Vides, 1993, in press).
In the following we will address the question In the following section we will discuss each of
of which are the distinctive features of regula- the properties that are important in the descrip-
tory sites with information on specificity of bind- tion of these regulatory sites.
ing of regulatory proteins.
3.1. Specificity of protein-DNA interaction
3. Distinctive features of regulatory sites In representations like that of (10) of the pre-
vious paper, we used 'binds CRP' and 'binds
Operator and activator categories are the sites LacI' as features in proximal regulatory sites.
where regulatory proteins bind to DNA and Following a reasoning similar to the discussion
then regulate gene expression by either preven- on promoter identification, we can argue that
ting or helping RNA polymerase to clear the the description of the DNA binding domain of
promoter and proceed into elongation of tran- the proteins can be reduced to only the elements
scription. In this section we will select the set of that are pertinent for the DNA-protein recogni-
properties of the Op and I categories that can tion. Thus, for instance, the recognition helix of
qualify as distinctive features. These properties GalR and LacI differ only by three amino acids
will include those that correspond to the Op/I (Lehming et al., 1987). The description of such
DNA target sites, those of their respective bind- domains could therefore be reduced to a matrix
ing proteins, and those describing their interac- description indicating the positions with the
tion. The set of distinctive features are amino acids that are different. A simplified de-
associated, in the dictionary, with the DNA se- scription for protein-DNA specificity depends on
quences of these sites. the discovery of general rules for the different
As a first approximation, one could think that types of protein domains for such interactions.
the set of these sites is divided into different mo- This is, however, a topic in itself as complex as
110

the search for general rules in the description of The fact that large differences exist in the
UGIs (Lehming et al., 1990). binding, for instance, of LexA to sites that occur
Since the dictionary contains the complete se- in nature indicates the importance of
quence of the DNA site, a 'protein-DNA' feature distinguishing weak and strong sites. Thus,
bearing the specificity of that interaction only several classes of affinity have to be considered.
requires to identify the protein domain. Such Remote LacI sites are weak binding sites,
domain can be identified by the name of the whereas the primary proximal site is the one
protein. It should be clear, nonetheless, that con- with the highest affinity (Gralla, 1989).
ceptually, such name corresponds to the minimal Finally, recall that the binding of a regulator
set of amino acids used within a set of principles depends on the sequence of the DNA target site.
that better describe the pertinent features of Within the levels of description so far available
protein-DNA interaction. in the grammar, there is no other alternative but
to identify the different variations, or mutant
3.2. The affinity of the protein-DNA interaction operator sequences for a specific regulator, as
Depending on the specific nucleotide se- different molecular 'words'. Each sequence of a
quence, the strength or affinity of the binding of set that binds the same target will only differ in
a repressor protein to a DNA binding site can its affinity feature. In fact, it may well be that
vary by more than three orders of magnitude. more than one sequence determines the same
Substitutions within the class of all the se- affinity feature.
quences that recognize a single repressor are ex- In summary, these arguments support the
pected to affect regulation considerably. 'affinity' as a distinctive feature of regulatory
For instance, the dissociation constant, Kd, sites. A feature with two values, high (+) and
for LexA binding to recA is 2 nM, whereas the low (-) affinity, or with more values, can be
Kd for binding to uvrB, lexA and clel is 20, 20 used.
and 0.04 nM, respectively (Walker, 1987). The affinity or strength of the protein-DNA
Similarly, an ideal lac operator binds eight times interaction can be described at different levels,
stronger the lac repressor than the E. coli prox- either at the level of an 'intrinsic affinity' or
imal operator sequence (Simons et al., 1984). 'strength' parameter or using the equilibrium
Changes in the affinity of the lac operator constant of dissociation, Kd, of the protein-DNA
modify the promoter activity accordingly (Jobe complex and the respective kinetic parameters
et al., 1974). A correlation between affinity and of association and dissociation; or, even more,
deviation from homology has been observed in a the elementary steps of the mechanism of reac-
collection of CRP-binding sites (Berg and von tion could be used. Recall, however, that we are
Hippel, 1988). Variations in the affinity of identifying the level of description of this infor-
repressors may establish a continuum of degree mation that better corresponds to what we
of efficiency of repression, from constitutive to defined as a distinctive feature.
normally regulated to excessively repressed Consider, for instance, activator mechanisms.
UGIs. The affinity of RNA polymerase binding or the
It is important to note that the range of affi- kinetic transition to open complex formation
nity we are talking about refers to specific bind- and subsequent elongation can be modified
ing of regulatory proteins to their DNA targets. (McClure, 1985); increased escaping from
Sequences of DNA that can bind unspecifically elongation is also documented (Menendez et al.,
to proteins are ruled out from the grammatical 1987).
model by simply excluding them from the dic- A detailed description of these cases involves
tionary. Similarly, putative binding sites for several chemical parameters within the Pr cate-
which there is not sufficient evidence are also gory, i.e. the equilibrium binding of the
excluded (see types of evidence in Collado-Vides promoter-RNA polymerase binding, the kinetic
et al., 1991). constant of the transition of closed to open corn-
111

plex, and kinetic constants of the transition to The theoretical reconstruction within a gram-
elongation. In order to conserve a homogeneous matical model of these properties of regulatory
level of description throughout the grammar, all sites is discussed in the grammatical model
cases should include these parameters. presented in Collado-Vides, 1992.
The importance of such information for the
understanding of regulatory mechanisms is not 3.3. The specificity of the sensor metabolite
being questioned. What we need to determine is Regulation of gene expression is dependent
which information contributes to distinguishing on different physiological conditions that are
among different classes of substitutable pro- sensed by regulatory systems by means of an
moters. These classes would be justified if dif- interaction of small metabolites or chemical
ferent classes of activators occurring with the modification of the regulatory proteins. Repres-
respective classes of promoters were found. sion and depression of the lac system in E. coli
However, even if there are no studies on a large is dependent on physiological conditions that are
number of proteins, evidence available indicates reflected in the intracellular concentration of
that one cannot identify between 'the activator allolactose, the natural inducer that is recogniz-
protein that enhances binding of polymerase' ed by the LacI repressor. This allosteric site can
and 'the activator that enhances the transition be modified without changing the repressor-
to open complex' as distinct classes of proteins. DNA interaction (Bourgeois and Jobe, 1970),
Certainly the mechanistic properties of activa- consequently affecting regulation.
tors depend, not only on the protein but also on Considering the importance of these signal
the specific position of the site (Gaston et al., metabolites in regulation, it is intuitively attrac-
1990). In fact the same regulatory protein can tive to include them as distinctive features. For
exert its activating effect by different the sake of a first linguistic description, a 'signal
mechanisms (Igarashi et al., 1991; Richet et al., feature' with the name of the allosteric metabo-
1991; Shevell and Walker, 1991). Similarly, the lite or the covalent modification as its value
repressibility of operator sites depends not only would suffice. Thus, LacI would have a signal
on the intrinsic affinity for the repressor but feature whose value is 'allolactose'. This value
also on its precise location in relation to the pro- indicates the specificity of the regulatory pro-
moter (Lanzer and Bujard, 1988). And this time tein. An additional feature is required if we
we cannot easily include the positional compo- want to distinguish the different protein confor-
nent by multiplying the number of entries in the mations associated with the binding and unbin-
dictionary, since that would create entries with ding of such signal molecules. It is important to
identical sequences that only differ by their posi- note that by specificity of the sensor metabolite
tional information. Such an approach would be we refer to the allosteric site of binding of an in-
counterintuitive to the notion of a grammar as a ducer, or binding of a corepressor or to the
device that generates the diversity of UGIs with chemical modification associated with a confor-
the use of a combinatorial component (Collado- mational change of the regulatory protein.
Vides, 1993). The allosteric specificity is explicitly used in
Based on the previous observations, we pro- an extended grammatical model that assigns dif-
pose to consider a property that measures the ferent representations to alternative regulatory
efficiency of the regulatory sites formed by two states of UGIs. Such an extended model pro-
components. One is the intrinsic affinity, a poses two basic representations for each UGI
distinctive feature included in the dictionary (Collado-Vides, 1989 a). What is called the G
characterizing the protein-DNA interaction. The representation (for 'genome'), equivalent to
other property, which can be considered a sort the so-called first linguistic representation, is
of 'correction' of the intrinsic affinity, depends obtained by grammatical rules like those il-
on the precise position of the sequence, and lustrated in Collado-Vides (1992 (diagrams
therefore, is not included in the dictionary. 2-5)). Subsequent E representations (for
112

'expression') can then be obtained that reflect cient number of different cases of regulation in
the alternative regulatory states of UGIs. sigma 70 to justify such additional affinity
The transition from G to E representations is feature.
achieved by transformational rules. These rules
modify the location of the regulatory protein 3.5. The stab~ form of a regulatory protein in
and its associated allosteric metabolite in the solution
linguistic representation of a UGI. In this way a The predominant multimeric nature of regula-
simplified description of the binding and unbin- tory proteins in solution varies depending on the
ding of regulatory proteins to their target sites protein. Thus, for instance, the lambda
was obtained. In this extended model the sensor repressor is a dimer (Ptashne, 1986), Lacl
metabolite was treated as a feature of the pro- repressor is a tetramer (Riggs et al., 1970), and
tein when bound to it and as a weak lexical ele- AraC is a dimer (Wilcox and Meuris, 1976),
ment when represented free in solution whereas DeoR is an octamer (Mortensen et al.,
(Collado-Vides, 1989). 1989). In Table 1 we have collected several pro-
The incorporation of an allosteric feature perties of the regulatory proteins, including the
within the set of distinctive features of regula- multimeric nature of the protein, the size of the
tory sites will facilitate the establishing of a link binding site, and the type of repeat.
between the grammar of sigma 70 as conceived The multimeric nature of the protein is impor-
here and an extended model that describes alter- tant for understanding mechanisms of regula-
native states of regulation by means of transfor- tion involving bound DNA between multiple
mational rules. homologous sites. Certainly the looping forma-
tion by a protein that binds as a tetramer in-
3.$. The domain of homologous protein-protein volves only two specific recognition events, the
interactions binding of the protein to one site and the subse-
The specificity of domains of regulatory pro- quent binding of the other DNA site to the
teins determines the interactions among bound protein. In the case of a dimer, the loop
homologous regulatory proteins that participate formation involves, in principle, three recogni-
in systems of regulation with multiple sites. tion events, the separate binding of a dimer to
These interactions can play an important role in each DNA site together with the recognition be-
repressing mechanisms, for instance involving tween the two bound proteins.
looping of the intervening DNA. In other cases Observe, however, that a dimer of AraC can
with duplicated sites more closely located to form a loop of DNA, which means that a single
each other, such protein domains mediate co- monomer binds to each DNA site (Lobell and
operative interactions. Modification of such do- Schleif, 1990). But, unlike DeoR, GalR, or LacI,
mains can impair a mechanism of regulation AraC is an activator whose site is not an invert-
even if the protein-DNA interaction is not al- ed repeat. Furthermore, the size of a direct re-
tered (Haber and Adhya, 1988; Menon and Lee, peat of AraC is exceptionally large: 17 base
1990). This property is then necessary to pairs, compared with the usual 10 bp size of
achieve a description of a system of regulation. direct repeats of other activators (see Table I).
There is no doubt that properties of this type These observations suggest that AraC
will have to be analysed in much more detail stoichiometry in looping formation is more an
when considering eukaryotic systems of regula- exception than a usual case.
tion, where additional distinctions will be re- One would expect that the larger multimeric
quired (Ptashne, 1988). In fact one can consider proteins facilitate looping of DNA, allowing for
an affinity feature associated with each dif- larger distances between the operator sites.
ferent specificity property of regulatory pro- Observe that the most distant operator sites of
teins. Although this is reasonable in principle the collection are those involved in looping for-
(Bourgeois and Jobe, 1970), there is not a suffi- mation with DeoR, which is an octamer
113

(Amouyal et al., 1989). Incorporation of the 3.6. Orientation of sites; internal structure:
multimeric stable state of proteins as a distinc- direct and inverted repeats and asymmetric
tive feature depends on whether it proves useful sites
in identifying classes of regulatory proteins with An entry in the dictionary associates a set of
similar restrictions in the occurrence of remote distinctive features with a short sequence of
sites. DNA. This sequence is oriented, indicating

Table I. Multimeric state of regulatory proteins and size of their binding sites

Protein Multimer Size Orientation References

Ada 1 12 a Asymmetric Sakumi and Sekiguchi, 1989; Shevell and


Walker, 1991
AraC 2 17b Direct Lee et al., 1987; Wilcox and Meuris, 1976
ArgR 6 16 Inverted Cunin et al., 1983; Lim et al., 1987
BioB 2 40 Inverted Otsuka and Abelson, 1978
CRP 2 19 Inverted Busby, 1986
CysB 4 36 - 46 a a Ostrowski and Kredich, 1991
CytR n.d.C 40a n.d.C Gerlach et al., 1991; S~gaard-Andersen et
al., 1991
DeoR 8 16 Inverted Amouyal et al., 1989; Mortensen et al., 1989
DnaA 1a 95 Direct a Wang and Kagun, 1989
FNR n.d. c 22 Inverted Eiglmeir et al., 1989
Fur 2a 19 Inverted de Lorenzo et al., 1987
GalR 2 17 Inverted Majumdar and Adhya, 1987
GlpR 4 20 Inverted Ye and Larson, 1988
IcIR n.d.C 34 Inverted Cortay et al., 1991
IlvY n.d.C 26 Inverted Wek and Hatfield, 1988
LacI 4 20 Inverted Gilbert and Miiller-Hill, 1966; Riggs et al.,
1970
LexA 2 20 Inverted Little et al., 1981
MalT 1 10 b Direct Vidal-Ingigliardi et al., 1991
MetJ 2 85 Direct Phillips et al., 1989
MetR n.d. c 24 Inverted a Urbanowski and Stauffer, 1989
NR(I) 2 15 Inverted Ueno-Nishio et al., 1984; Reitzer and
Magasanik, 1983
OmpR 1,2a 10b Direct Maeda and Mizuno, 1988, 1990
OxyR n.d. c >45 _a Tartaglia et al., 1989
PboB n.d. c 17 b'a Direct Makino et al., 1986
PifC n.d.C 20 Inverted Kennedy et al., 1988
Purr 1- 2a 16 Inverted Rolfes and Zalkin, 1990
PutA 2 27 Inverted Ostrovsky et al., 1991
RafR n.d. c 18 Inverted Aslanidis and Schmitt, 1990
RhaR 2 20 Inverted Tobin and Schleif, 1990
TetR 2 25 Inverted Takahashi et al., 1986
TrpR 2 27 Inverted Klig et al., 1988
TyrR n.d. c 22 Inverted Chye and Pittard, 1987

a Not a precisely defined property.


b Direct repeat or 'box' size.
c n.d., not determined.
114

which is the 5' end and which is the 3' end. Cer- based on their regulatory properties, which de-
tainly some regulatory proteins exert their pend basically on the categorial Op/I identifica-
effect only in a specific orientation in relation to tion of sites. The distinction between direct and
the promoter. inverted repeats is a property internal to molec-
However, in some cases the inversion in the ular words, which does not directly determine
orientation of a complete binding site conserves their substitutability.
the regulatory properties of the bound protein. It has been proposed that the distinction
Although mostly common in enhancers, some between direct and inverted repeat may be re-
bacterial sigma 70 and sigma 54 activators are lated to the evolutionary origin of regulatory pro-
also orientation-independent (De Crombrugghe teins (Raibaud, 1989). Thus, they may be justified
et al., 1982; Maeda and Mizuno, 1988; Reitzer as distinctive features if in future developments
and Magasanik, 1986). The addition of a feature of the linguistic theory a component containing
indicating orientation independence (+) or ori- evolutionary information is organised.
entation dependence (-) associated with regula-
tory sites gives a more economic representation 3.7. Positional restrictions and phase restric-
than the alternative solution of considering each tions
oriented sequence as a different molecular Finally, we have to discuss information on
'word' with all the properties repeated. position of sites. In the accompanying paper we
This orientation feature must not be confused have already shown that the precise position of
with an arrow feature used in a grammatical proximal sites is a distinctive feature. Here we
derivation that enables a single representation of will discuss alternative ways to represent the
proximally located multiple promoters. In these position of remote sites.
representations that can also include divergent Recall that remote sites cover the range from
promoters, an arrow was used on promoters to -900 for DeoR to + 400 for LacI. The positions
indicate the direction of transcription and on Op of duplicated homologous operators and activa-
and I sites to indicate the direction of promoters tor sites have been analysed in detail elsewhere
that are subject to their regulatory effect (Collado-Vides and Claverie-Martin, unpublish-
(Collado-Vides, 1991). These arrow features, how- ed). The analysis supports a general rule in the
ever, are not necessary in descriptions analysed architecture of arrays of duplicated homologous
here restricted to a single promoter. operators allowing the proteins to bind on the
Regulatory proteins usually bind as dimers to same face of the helix.
a single Op or I site. Symmetry properties are A simple way to capture this general rule in
therefore common in these DNA sequences, the grammatical description is to indicate the re-
with sites usually having direct or inverted re- lative distance of remote operator sites in rela-
peats, although cases with no repeating units tion to the proximal site using the number of
might also exist (see Table 1). Frequently Op helix turn repeats as a measuring unit. Recall,
sites are formed by inverted repeats, and activa- however, that the position of remote sites is in-
tor proteins bind to sites formed by direct re- dicated always in relation to the proximal refer-
peats, although there are several regulatory ence site. If phasing information is to be used
proteins that do not follow that correlation, as also for position, it is restricted to distance rela-
can be observed in Table 1. In some few cases, tive to the proximal site. It is possible to con-
direct repeats occur as separated half-sites ac- sider a set of homologous duplicated sites where
companying a pair of direct repeats that form not all of them are in phase, particularly in rela-
either an Op or an I site (Collado-Vides, 1993). tion to the proximal site. The malK promoter il-
These properties are not expected to contribute lustrates in fact a case where the proximal site
to the identification of classes of substitutable is not in phase, while other sites are in phase
elements. Recall that the grammatical rules will (Raibaud et al., 1989).
permit a combinatorial description of UGIs Thus, the most reasonable description is that
115

position information of remote sites is indicated where c indicates the co-ordinates with -60 the
by its relative distance to the proximal site, as position of the central nucleotide. The j subindex
has been discussed in the previous paper. In ad- is used to differentiate this site from those bind-
dition, sets of sites that co-operate via protein- ing the LacI protein, which have an i sub-index.
protein interactions will have a phasing feature The size of sites is considered a property associ-
with indication of their relative distance ated with the protein, therefore it is not explicit-
measured in helix turns. Proximal or distal sites ly included in (1). The dictionary contains such
can have this feature, which gives an explicit 'word' properties associated with the different
identification of those sites supporting either co- proteins. Aft, the affinity feature, can be assum-
operative interactions or DNA looping. ed to take two values for high (+) or low (-) affi-
Thus, we conserve the distinction between nity. The 'sensor' feature indicates the
proximal and remote sites by means of the co- specificity of the metabolite that induces the al-
ordinate and distance features, as proposed in losteric modification responsible for alternative
the previous paper. Additionally, a phasing fea- regulatory states of the protein, m is the feature
ture, indicating the distance measured in helix for the multimeric state of the protein bound to
turns between any two sites, identifies those DNA, the value of 2 indicating dimers. 'p-p' in-
sites supporting co-operative protein-protein in- dicates the protein-protein specificity. In the ab-
teractions and/or interactions enabling a DNA sence of a better name, 'CRP dimers' is used to
loop between them. refer to the specificity of the domain of CRP
proteins that is involved in homologous protein-
3.8. Recapitulation of Op and I distinctive protein interactions. It is misleading in the sense
features that a protein groups together three different
In summary, we started analysing the
specificities, of which 'CRP dimers' is only one,
specificity of the protein-DNA interaction of
as discussed before. The individual identification
regulatory proteins searching for an economic
of these properties reflects the assumption that
way to describe the property 'binds X'. How-
such specificity determinants can in principle be
ever, we saw that there are many more distinc-
altered independently.
tire features that naturally belong to 0p and I
Finally, an orientation feature indicates that
categories. These include three domains of
inverted CRP sites can activate transcription.
specificity: the specificity of the protein-DNA in-
Similar matrices can be used for other proximal
teraction, the specificity of the sensor metabo-
sites.
lite or domain of allosteric or chemical
In the case of remote sites, a distance d fea-
modification of the protein, and the specificity of
ture conveys the positional information with the
the domains that determine protein-protein in-
number of bases that separates it from the prox-
teractions that participate in systems of regula-
imal referential site. The index indicates that
tion with multiple sites. The set of features
these sites share the same specificities as the
included in a proximal site, such as for instance
proximal referential site with the same index.
the CRP binding site in the structure (12) of the
Therefore, these shared features are not includ-
previous paper, can be stipulated as:
ed in remote sites. Additionally, proximal and
-- I --
remote sites have a phase feature indicating a
cj = -60 distance measured in helix turns when they are
Aft: (+) involved in co-operative interactions or in DNA
sensor cAMP looping.
m=2 (1) Indication of symmetry properties of sites is
p-p: CRP dimers excluded, because it does not help to identify
DNA-binding: CRP sites classes of sites subject to uniform substitut-
__ orientation: (+) _ ability restrictions.
116
Regulator Promoter

Ada ada
/cii_52/+f~,l
Ada) ~ )

Ada alkA
Ada )

AraC,CRP araBAD
~ph=19.0) eL6,1` /
op
AraC,CRP araC di=+1361
ph=13.0)

AraC araE
AraC ) :'43/,71
/dJci~ 6J + ci--D~-64/+

o0
ArgR argCBH (ci=+19/
/ArgR) ~ph=2.0)

ArgR argE
ArgR ) ph=2.0)

ArgR argF
/ ArgR ) )h=2.0)

ArgR a rgl + |ci=-231 + | ci=-2


~,ArgR ) (,ph=2.0

~r/+ ciOP29/+ ci_-Ol~_7


ArgR argR pl
ArgR) 9h=2.1

ArgR carAB p2 + (ci=+14/


(,ArgR) (,ph=2.1)

BioB bioA
/ BioB )
f

~.~.~ +
,T ,7
+
J
"4-
+ +
-I- 4-
1- ÷~ ~,~o
b~
r ~ r
~o
~,.- J

~ J
+
~J
oo
0 0
>

t.~

f £b r ~ r ~ L.O ,.-4
~ '0
&'= ~'0
+ .&. + -t-
& +
,-.4
J
+ "4- ~ J \ J ~. J 4- +
r r ,_~ + "~'o 4- +
.
f ~ f ~
,<N'O
~ J
+
+ +
f -x
+
+
II
a? ,7
+
'..
b, .J
4- + +
'T? t~ -. j
p +
t~ M ~

.~.

b,
+
r ~

4-
119

FNR,NarL narG
dj =_I159#)+/dj=-36#/+/ci=-41/+
I I ? r/
t NarL) tFNR ) t )

FNR,NarL nirB
NarL )

Fur cir p l ~ r)+/ci=O-P32) + IciOP26


~ Fur

Fur cir p2 / Pr +/ci=Off21 +/ciO-Pl31


t Fur )

Fur fhuA
~ Fur

Fur iucA p l
t,ph=2.2) ~ Fur )

GalR,CRP gal p l J~2~/~'5 + + Ici__OP_61+~i_-+1o41


Op
t Gain t ph=10.0)

GalR,CRP gal p2 ~r) +/ciOP56/ + (cj~6) + di=+104


O19
t. OalR ) ~ph=lO.O

GIpR,CRP glpO fci=6~l+(~l+~c~'/ +Ic~:+~01


I r
Op
CRP ) ~ ) ~GIpR) ~.ph=l.8)
Pr~ ( Op
IclR aceBAK
I / + I ci=-40#
) ~ Icla

IlvY ilvC pCiI64 +/ciI-31)+ ~r 1


h=3.1 t IlvY J

IlvY i lv Y
{P'
( IlvY ) tph=3.1)
120

LacI, CRP lac +


(i/?)
~:~o+ + ~c~9~+r~i:+,o~)
oo
t,Lacl) /ph=8.9 )

LexA cloDF14
) CLexn)

LexA, CRP colE1 pl


ph=4.2 LexA) t,ph=1.6

LexA lexA
fr/+/c,°:,o/ oo
/LexA ) ~ph=2.0)

er ,iOPl/
LexA recA
LexA )

LexA ssb
t LexA )

LexA sulA / /LexA)

LexA uvrA
/ LexA )

LexA uvrB p2
/P'/+
LexA )

Pr ( Op
LexA uvrD 1 + / ci=+11
) ~LexA

( 1/2I ") I I I I I "~ ( v q


MalT, CRP malE ?:,~9j +/di=-160/+~dj=-95 /+ dj=-61)+fdj=-32)+ ci=-44|+| /
~ph=15.3)lph=10.8 ) ph=2.8) ~ CRP ) MalT) t. )

MalT, CRP malK |ci=-681+|ci=-37.5|


t.ph=15.3) t.ph=10.8) t,ph=2.8) t CRP ) ~MalX) t )

MalT malPQ ci---66|+|ci=-36.51 +


MalT) t, )
121
MetJ metA pl r/+rc,:~.~] o0
+ ci=+20.5
t ) ~ MetJ )

MetJ metB ci=-31.5

Pr']+ ([d ]I/2Op"~ Op Op


MetJ metJ
J i=-75.5) t, ) t MetJ )

MetJ metF + (ci=+4,5]+ Op


t. MetJ )

MetR, MetJ metE | 1+ Ici=+6]'~ ci=+17|


t,MetR) t, ) t, ) (, MetJ )

MetR metR ci=+%4]


MetR )

MetR metH
iI-56] + ~ r]
etR )

NR(I), CRP glnA pl ~=g,+ + c?t+[ °,% ]


NR(I)) t. ph=3.2)
r Op
NR(I) glnL ~]+(c,:_1]
l ) /NR(I))

iI-40] ~ci=I-66 ] + (ciI-46] + (Pr)


OmpR ompC
I:h=3.8) (ph=l.9) t.OmpC)

Op I
OmpR ompF [di=-a22]cfdi=-50]+ fdi--I-40 +(ciI-30
{,ph=31.5) {,oh=4.8) (ph=3.8 t,ph=Z.9 (OmpC

OxyR aapC
OxyR )

OxyR katG
OxyR )

OxyR ahpC Il+rciI=-/oxy


J
+
f
~:~+
J
+ 0 il
+
f f ,x J
+
tf3 f -x
,i
.1~ ~ . .
+ +
+
J
©ll~. ©.~ . . f o .,,
f
f
+
.-, J
'Fa ~ ~ J ~ J
~ J
+ + + + +
+ + + +
~e~, o rn 0 C "
T. "
~ J
~ J
g
q

f r ~ r ~ r ~ r
r ~
= ~';'o
,. 2, + + +
+ + + +
+

+ ~ II ~
+ + .o ~o ~,7o ,,o ~'7o
,,~ II 0
r " xh
J
+ + +
,7
f

~ II 0 ii II O ~.~ ~', +
~ + ~ ~ ' 0 J
+
~ J

+
g,7o
+

+ ~.

+
f c) ~'

>E
124

TyrR aroL /Pr) ( Op Op


t, ) ~,ph=5.4 ~ph=2.0) / T y r R )

TyrR aroP /+ Ici:÷41i ÷ [di=+%3


) (TyrR) ~,ph=2.1

TyrR tyrB
~TyrR ) ~.ph=2.1)

TyrR tyrP
kph=2,1) / T y r R )

TyrR tyrR
/ TyrR )

Fig. 1. A distinctive representation of sigma 70 UGIs. This linguistic representation of the range of transcription initiation
of sigma 70 promoters results from the application of the proposals discussed throughout this and the accompanying paper.
Proximal sites have a coordinate c feature indicating the position of their central nucleotide. Remote sites have a distance d
feature indicating the number of center-to-center bases relative to the proximal referential site. For size of sites see Table I.
The referential site is the one that bears the name of the regulator. Co-indexation identifies the specificity of remote sites. A
phasing (ph) feature indicates the number of helix turns that separate homologous sites from the referential site. Phasing is
always in relation to the site of reference for co-indexation or specificity. The # symbol is associated with not precisely defined
properties. For additional information see Fig. 1 of accompanying paper.

3.9. A complex 'binds X' feature the set of principles of the grammar of sigma 70
We have decomposed the property 'binds X', promoters, as well as of grammars of other pro-
where X is a protein, into the set of properties karyotic promoters. In this way an important
illustrated for a CRP site in Fig. 1. A particular difference among systems of regulation is em-
characteristic that applies to bacterial regula- phasized.
tory proteins can be made explicit by in- Thus, a property 'binds X', where X is a pro-
tegrating this set of distinctive features into a tein, can substitute for several properties direct-
complex feature. ly dependent on the protein. Those properties
The grouping of several features into a com- that are not associated with a particular protein
plex 'binds X' feature can reflect the fact that are excluded from this grouping, such as the
even though the features can be modified in- affinity of the binding site and position and
dependently, they are all properties associated phasing information. The co-indexing of remote
with a single protein. This contrasts with sites will then refer to homology of the set of
mechanisms of regulation in eukaryotes, which properties associated with a particular protein.
frequently involve interactions among several With this proposal we finish the analysis of
proteins, each one containing different domains distinctive features of regulatory sites. The ap-
of specificity that are all required together to ac- plication of the principle of precedence, as well
tivate a promoter. This grouping of features into as the identification of the distinctive features of
a single complex can then be considered part of Pr, Op and I categories, enables us to transcribe
125

the arrays of sigma 70 UGIs collected in Fig. 1 Study is different. The dictionary of the gram-
of the previous paper into their corresponding mar associates with each 'word' a list of perti-
linguistic description, shown here in Fig. 1. This nent selected properties, the sequence of DNA,
diagram summarizes the distinctive features 'raw information', being one of them. There is
that have been identified throughout these two nothing equivalent, in a linguistic theory of nat-
papers. ural language, to a dictionary that groups acous-
tic descriptions and their associated phonologi-
4. Discussion cal properties. An equivalent treatment of mo-
lecular data would involve the definition of mo-
The results that Fig. 1 represent have been lecular 'words' using only distinctive features.
obtained by studying all the properties that we Following this approach, only two promoter
know of as relevant to the regulatability of 'words' would be present in the dictionary, that
sigma 70 UGIs. For instance, detailed parame- of 'promoters with high affinity' and that of
ters of mechanism of reaction have been exclud- 'promoters with low affinity' (see Section 2).
ed, since they do not seem to contribute to the The differences within these classes would be
identification of classes of substitutable ele- considered 'negligible variants'. However, the
ments. These two papers illustrate that the incorporation into the dictionary of the different
problem of finding the best linguistic repre- sequences of Prs, Ops and Is makes more sense
sentation, following explicit criteria, can give in the search for a theory of gene regulation
concrete results. Recall, for instance, the with a combinatorial component (Collado-Vides,
representation of positional information as 1993, in press).
either an independent category or as a feature The analysis presented here assumes that
of regulatory sites. The use of a small number of there is a finite set of features that define the
criteria in the selection of properties of quite dif- best representation of a particular 'reading sys-
ferent nature represents by itself a descriptive tem'. However, since we cannot rule out the
challenge, as these papers illustrate. The distinc- discovery in the future of new correlations be-
tive features obtained are the elements of a the- tween relevant properties of such regulatory in-
oretical characterization of a particular 'reading terpretation of UGIs, the set of features here
system'. When dealing with different 'reading identified may not be the minimal one. Unknown
systems', a similar analysis would be required. properties yet to be discovered may satisfy the
Among the similarities between the transition criteria of distinctive features. In some cases we
of an acoustic to a phonological description in were able to select distinctive features although
the study of natural language, and the transition their specific values associated are not yet defin-
of a description of UGIs to an L1 linguistic ed. This is a consequence of a formalization of a
description, we have the following: (a) a data-base where each case is not equally well
phonological description uses elements that known. Certainly the proposals presented here
characterize natural language as a system; (b) it might be modified as our understanding of gene
is a transcription that presents 'raw' informa- regulation increases in the future.
tion in a form suitable for linguistic analysis; (c) The construction of a grammatical theory of
the criteria for defining distinctive features are gene regulation depends to a considerable ex-
in a way the same as those used here. It is tent on the comparison and selection of alterna-
no surprise that these similarities exist, because tive ways to represent the data, from a single
of the similarity in methods used in the property such as the distance between
phonological study of natural language (Chom- categories to complete alternative grammatical
sky and Halle, 1968) and in the study of distinc- models. In fact the study of natural language us-
tive features of sigma 70 UGIs. ing generative grammars is a discipline of em-
On the other hand it is also no surprise to see pirical investigation where one could say that
important differences, since the object of each our progress in understanding is associated with
126

novel proposals on representations or with novel repressor binds to its natural operator sites. Cell 58,
relations between different representations. 545- 551.
Recall that the theory of linguistic structure is Aslanidis, C. and Schmitt, R., 1990, Regulatory elements of
the raffinose operon: nucleotide sequences of operator
'essentially, the abstract study of levels of and repressor genes. J. Bacteriol. 172, 2178-2180.
representation' (Chomsky, 1955). Berg, O.G. and von Hippel, P.H., 1988, Selection of DNA
The selection of good representations is im- binding sites by regulatory proteins. II: The binding
portant not only within linguistic theory or ap- specificityof cyclic AMP receptor protein to recognition
proaches to artificial intelligence but also in sites. J. Mol. Biol. 200, 709-723.
Borowiec, J.A., Zhang, L., Sasse-Dwight, S. and Gralla,
natural sciences. This is illustrated, for instance, J.D., 1987, DNA supercoiling promotes formation of a
by the selection from a large number of physical bent repression loop in lac DNA. J. Mol. Biol. 196,
properties of those that are the most important 101 - 111.
for determining protein structure and conforma- Bourgeois, S. and Jobe, A., 1970, Superrepressors of the lac
tion (Kidera et al., 1985). operon, in: The Lactose Operon, J.R. Beckwith and D.
Zipser (eds.) (Cold Spring Harbor Laboratory, New
Grammatical models of gene regulation may York), pp. 325-342.
represent a useful contribution to the computer Busby, S.J.W., 1986, Positive regulation in gene expression,
scientist when dealing with the processing of in: Regulation of Gene Expression: Twenty-Five Years
large amounts of deciphered information con- On, I.R. Booth and C.F. Higgins (eds.) (Cambridge Uni-
cerning the regulation of gene expression. versity Press, Cambridge), pp. 51- 77.
Chye, M.L. and Pittard, J., 1987, Transcription control of
The linguistic formalization of gene regulation the aroP gene in Escherichia coli K-12: role of the
constitutes a set of ordered results that include upstream operator site. Mol. Microbiol. 2, 377-383.
the identification of the smallest elements of Chomsky, N., 1955, The Logical Structure of Linguistic
productive substitutions, the identification of Theory (University of Chicago Press, Chicago).
distinctive features, the construction of the com- Chomsky, N. and Halle, M., 1968, The Sound Pattern of
English (MIT Press, Cambridge, Mass.).
ponent of rewriting rules, and their integration Collado-Vides, J., 1989, A transformational-grammar ap-
into a grammatical model. proach to the study of the regulation of gene expression.
A modification in any of these steps can have J. Theor. Biol. 136, 403-425.
important consequences in other components of Collado-Vides, J., 1991, A syntactic representation of units
this integrative linguistic formalization. This of genetic information. J. Theor. Biol. 148, 401-429.
Collado-Vides, J., 1992, Grammatical model of the regula-
reconstructive effort at formalization should be tion of gene expression. Proc. Natl. Acad. SCI. USA 89,
evaluated in terms of its capacity to integrate 9405 - 9409.
large amounts of information and in t e r m s of its Collado-Vides, J., 1993, The elements for a classification of
predictive power (Collado-Vides, 1992). units of genetic information with a combinatorialcompo-
nent. J. Theor. Biol. (in press).
Collado-Vides, J., Magasanik, B. and Gralla, J.D., 1991,
Acknowledgments
Control site location and transcriptional regulation in Es-
cherichia coli. Microbiol. Rev. 55, 371- 394.
The author acknowledges several fruitful Cortay, J.C., N~gre, D., Galinier, A., Duclos, B., Perri~re,
discussions throughout the elaboration of this G. and Cozzone, A.J., 1991, Regulation of the acetate
work with Prof. Boris Magasanik. This work operon in Escherichia coli: purification and functional
characterization of the IcIR repressor. EMBO. J. 10,
was begun when the author was a postdoctoral 675 - 679.
fellow in his laboratory in the D e p a r t m e n t of Cunin, R., Eckhardt, T., Piette, J., Boyen, A., Pidrard, A.
Biology of the Massachusetts Institute of Tech- and Glansdorff, N., 1983, Molecularbasis for modulated
nology. More recently the author has been sup- regulation of gene expression in the arginine regulon of
ported by a Guggenheim Fellowship. Escherichia coli K-12. Nucl. Acids Res. 11, 5007-5019.
De Crombrugghe, B., Busby, S. and Buc, H., 1982, Activa-
tion of transcription by the cyclicAMP receptor protein,
References in: Biologicalregulation and development, K. Yamamoto
(ed.), Vol. III-B (Plenum Press, New York), pp. 129 - 167.
Amouyal, M., Mortensen, L., Buc, H. and Hammer, K., de Lorenzo, V., Wee, S., Herrero, M. and Neilands, J.B.,
1989, Single and double loop formation when DeoR 1987, Operator sequences of the aerobactin operon of
127

plasmid ColV-K30 binding the ferric uptake regulation araBAD operon promoter. Proc. Natl. Acad. Sci. USA
(fur) repressor. J. Bacteriol. 169, 2624-2630. 84, 8814-8818.
Eiglmeir, K.N., Honors!, N., Iuchi, S., Lin, E.C.C. and Cole, Lehming, N., Sartorius, J., Kisters-Woike, von Wilcken-
S.T., 1989, Molecular genetic analysis of FNR-dependent Bergmann, B. and Mtiller-Hill, B., 1990, Mutant /ac
promoters. Mol. Microbiol. 3, 869-878. repressors with new specificitieshint at rules for protein-
Eyring, H. and Eyring, E., 1963, Modern Chemical Kinetics DNA recognition. EMBO J. 9, 615-621.
(Reindhold Publ., New York). Lehming, N., Sartorius, J., Niemoller, M., Genenger, G.,
Gaston, K., Bell, A., Kolb, A., Buc, H. and Busby, S., 1990, yon Wilcken-Bergmann, B. and Miiller-Hill, B., 1987,
Stringent spacing requirements for transcription activa- The interaction of the recognition helix of/ac repressor
tion by CRP. Cell 62, 733-743. with/ac operator. EMBO J. 6 (10), 3145-3153.
Gerlach, P., S~gaard-Andersen, L., Pedersen, H., Mar- Lira, D., Oppenheim, J.D., Eckhardt, T. and Maas, W.K.,
tinussen, J., Valentin-Hansen, P. and Bremer, E., 1991, 1987, Nucleotide sequence of the argR gene of Escheri-
The cyclic AMP {cAMP)-cAMP receptor protein complex chia coli K-12 and isolation of its product, the arginine
functions both as an activator and as a corepressor at the repressor. Proc. Natl. Acad. Sci. USA 84, 6697-6701.
t~x-p~ promoter of Escherichia coli K-12. J. Bacteriol. Little, J.W., Mount, D.W. and Yanish-Perron, C.R., 1981,
173, 5419- 5430. Purified ~ protein is a repressor of the recA and/ezA
Gilbert, W. and Mfiller-Hill, B., 1966, Isolation of the lac genes. Proc. Natl. Acad. Sci. USA 78, 4199-4203.
repressor. Proc. Natl. Acad. Sci. USA 56, 1891-1898. Lobell, R.B. and Schleif, R.F., 1990, DNA looping and
Gralla, J.D., 1989, Specific repression in the/ac operon: the unlooping by araC protein. Science 250, 528- 532.
1988 version, in: DNA-Protein Interactions in Transcrip- Maeda, S. and Mizuno, T. (1988) Activation of the ompC
tion, J.D. Gralla (ed.) {Alan R. Liss, New York), pp. gene by the OmpR protein in Escherichia coli: the c/s-
3-10. acting upstream sequence can function in both orienta-
Haber, R. and Adhya, S., 1988, Interaction of spatially tions with respect to the canonical promoter. J. BioL.
separated protein-DNA complexes for control of gene- Chem. 263, 14629-14633.
expression: operator conversions. Proc. Natl. Acad. Sci. Maeda, S. and Mizuno, T., 1990, Evidence for multiple
USA 85, 9683-9687. OmpR-binding sites in the upstream activation sequence
Hawley, D.K. and McClure, W.R., 1983, Compilation and of the ompC promoter in Escherichia coli: a single
analysis of Escherichia coli promoter DNA sequences. OmpR-binding site is capable of activating the promoter.
Nucl. Acids Res. 11, 2237-2255. J. Bacteriol. 172, 501- 503.
Helmann, J.D. and Chamberlin, M.J., 1988, Structure and Majumdar, A. and Adhya, S., 1987, Probing the structure of
function of bacterial sigma factors. Annu. Rev. Biochem. gal operator-repressor complexes: conformation change
57, 839-872. in DNA. J. Biol. Chem. 262, 13258-13262.
Igarashi, K., Hanamura, A., Makino, K., Aiba, H., Aiba, H., Makino, K., Shinagawa, H., Amemura, M. and Nakata, A.,
Mizuno, T., Nakata, A. and Ishihama, A., 1991, Func- 1986, Nucleotide sequence of the phoB gene, the positive
tional map of the alpha subunit of Escherichia coli RNA regulatory gene for the phosphate regnlon of Escherichia
polymerase: two modes of transcription activation by coli K-12. J. Mol. Biol. 190, 37-44.
positive factors. Proc. Natl. Acad. Sci. USA 88, McClure, W.R., 1985, Mechanism and control of transcrip-
8958 - 8962. tion initiation in prokaryotes. Annu. Rev. Biochem. 54,
Jobe, A., Sadler, J.R. and Bourgeois, S., 1974, lac repressor- 171 - 204.
operator interaction. IX: The binding of lac to operators Menendez, M., Kolb, A. and Buc, H., 1987, A new target for
containing Oc mutations. J. MoL Biol. 85, 231-248. CRP action at the malT promoter. EMBO J. 6,
Kennedy, M., Chandler, M. and Lane, D., 1988, Mapping 4227 - 4234.
and regulation of the pifC promoter of the F plasmid. Menon, K.P. and Lee, N.L., 1990, Activation of ara operons
Biochim. Biophys. Acta 950, 75-80. by a truncated AraC protein does not require inducer.
Kidera, A., Konishi, Y., Oka, M., Ooi, T. and Scheraga, Proc. Natl. Acad. Sci. USA 87, 3708-3712.
H.A., 1985, Statistical analysis of the physical properties Mortensen, L., Dandanell, G. and Hammer, K., 1989, Purifi-
of the 20 naturally occurring amino acids. J. Prot. Chem. cation and characterization of the deoR repressor of E.
4 (1), 23- 55. coli. EMBO J. 8, 325-331.
Klig, L.S., Carey, J. and Yanofsky, C., 1988 trp repressor Ostrovski de Spicer, P., O'Brien, K. and Maloy, S., 1991,
interactions with the trp aroH and trpR operators: com- Regulation of proline utilization in Salmonella
parison of repressor binding in vitro and repression in typhimurium: a membrane-associated dehydrogenase
vivo. J. Mol. Biol. 202, 769-777. binds DNA in vitro. J. Bacteriol. 173, 211-219.
Lanzer, M. and Bujard, H., 1988, Promoters largely deter- Ostrowski, J. and Kredich, N.M., 1991, Negative auto-
mine the efficiency of repressor action. Proc. Natl. Acad. regulation of cysB in Salmonella typhimurium: in vitro
Sci. USA 85, 8973-8977. interactions of CysB protein with CysB promoter. J.
Lee, N., Francklyn, C. and Hamilton, E.P., 1987, Arabinose- Bacteriol. 173, 2212-2218.
induced binding of AraC protein to araI2 activates the Otsuka, A. and Abelson, J., 1978, The regulatory region of
128

the biotion operon in Escherichia coli. Nature (London) complex in Escherichia coli: cAMP-CRP functions as an
276, 689-694. adapter for the CytR repressor in the deo operon. Mol.
Phillips, S.E.V., Mansfield, I., Parsons, I., Davidson, B.E., Microbiol. 5, 969-975.
Rafferty, J.B., Somers, W.S., Margarita, D., Cohen, Studnicka, G.M., 1988, Escherichia coli promoter -10 and
G.N., Saint-Girons, I. and Stockley, P.G., 1989, Coopera- -35 region homologies correlate with binding and isomer-
tive tandem binding of met repressor of Escherichia coli, ization kinetics. Biochem. J. 252, 825-831.
Nature (London) 341, 711-715. Takahashi, M., Altschmied, L. and Hillen, W., 1986, Kinetic
Ptashne, M., 1986, A Genetic Switch: Gene Control and and equilibrium characterization of the Tet represssor-
Phage Lambda (Cell Press, Cambridge, Mass.). tetracycline complex by fluorescence measurements. J.
Ptashne, M., 1988, How eukaryotic transcriptional activa- Mol. Biol. 187, 341-348.
tors work. Nature (London) 335, 683-689. Tartaglia, L.A., Storz, G. and Ames, B.N., 1989, Identifica-
Raiband, 0., 1989, Nucleoprotein structures at positively tion and molecular analysis of ozTR-regulated promoters
regulated bacterial promoters: homology with replication important for the bacterial adaptation to oxidative
origins and some hypotheses on the quaternary structure stress. J. Mol. Biol. 210, 709-719.
of activater proteins in these complexes. Mol. Microbiol. Telesnitsky, A.P.W. and Chamberlin, M.J., 1989, Sequences
3, 455-458. linked to prokaryotic promoters can affect the efficiency
Raibaud, O., Vidal-Ingigiiardi, D. and Richet, E., 1989, A of downstream termination sites. J. Mol. Biol. 205,
complex nucleoprotein structure involved inactivation of 315 - 330.
transcription of two divergent Escherichia coli pro- Tobin, J.F. and Schleif, R.F., 1990, Purification and proper-
moters. J. Mol. Biol. 205, 471- 485. ties of RhaR, the positive regulator of the L-rhamnose
Reitzer, L.J. and Magasanik, B., 1983, Isolation of the nitro- operons ofEscherichia coli. J. Mol. Biol. 211, 75-89.
gen assimilation regulator NR (I), the product of the glnG Tsung, K., Brissette, R.E. and Inouye, M., 1990, Enhance-
gene of Escherichia coli. Proc. Natl. Acad. Sci. USA 80, ment of RNA polymerase binding to promoters by a
5554- 5558. transcriptional activator, OmpR, in Escherichia coli: its
Reitzer, L.J. and Magasanik, B., 1986, Transcription ofglnA positive and negative effects on transcription. Proc.
in E. coli is stimulated by activator bound to sites far Natl. Acad. Sci. USA 87, 5940-5944.
from the promoter. Cell 45, 785-792. Ueno-Nishio, S., Mango, S., Reitzer, L.J. and Magasanik, B.
Richet, E. and Ralbaud, 0., 1991, Supercoiling is essential 1984, Identification and regulation of the glnG operator
for the formation and stability of the initiation complex promoter of the complex glnALG operon of Escherichia
at the divergent ma/Ep and ma/Kp promoters. J. Mol. coli F. J. Bacteriol. 160, 379-384.
Biol. 218, 529-542. Urbanowski, M.L. and Stauffer, G.V., 1989, Genetic and bio-
Richet, E., Vidal-Ingigliardi, D. and Ralbaud, 0., 1991, A chemical analysis of the MetR activator-binding site in
new mechanism for coactivation of transcription initia- the metE metR control region of Salmonella
tion: repositioning of an activator triggered by the bind- typhimurium. J. Bacteriol. 171, 5620-5629.
ing of a second activator. Cell 66, 1185-1195. Vidal-Ingigiiardi,D., Richet, E. and Raibaud, O., 1991, Two
Riggs, A., Suzuki, H. and Bourgeois, S., 1970,/ac repressor MalT binding sites in direct repeat. J. Mol. Biol. 218,
operator interaction. I: Equilibrium studies. J. Mol. Biol. 323 - 334.
48, 67-83. Walker, G.C., 1987, The SOS response in Escherichia coli,
Rotes, R.J. and Zalkin, H., 1990, Auteregulation of Escheri- in: Escherichia coli and Salmonella typhymurium:
chia coli purR requires two control sites downstream of Cellular and Molecular Biology, F.C. Neidhardt, J.L. In-
the promoter. J. Bacteriol. 172, 5758-5766. graham, L.K. Brooks, B. Magasanik, M. Schacchter and
Sakumi, K. and Sekiguchi, M., 1989, Regulation of expres- H.E. Umbrager (eds.), Vol. 2 (American Society for
sion of the ada gene controlling the adaptive response: Microbiology), pp. 1346-1357.
interactions with the ada promoter of the Ada protein Wang, Q. and Kagun, J.M., 1989, dnaA protein regulates
and RNA polymerase. J. Mol. Biol. 205, 373-385. transcription of the rpoH gene of Escherichia coli. J.
Simons, A., Tils, D., yon Wilcken-Bergmann, B. and Miiller- Biol. Chem. 264, 7338-7344.
Hill, B., 1984, Possible ideal lac operator: Escherichia Wek, R.C. and Hatfield, G.W., 1988, Transcriptional activa-
coli lac operator-like sequences from enkaryotic genomes tion at adjacent operators in the divergent-overlapping
lack the central G:C pair. Proc. Natl. Acad. Sci. USA 81, ilvY and ilvC promoters of Escherichia coll. J. Mol. Biol.
1624 - 1628. 203, 643-663.
Shevell, D.E. and Walker, G.C., 1991, A region of the Ada Wilcox, G. and Meuris, P., 1976, Stabilization and size of
DNA-repair protein required for the activation of ada AraC protein. Mol. Gen. C-enet. 145, 97-100.
transcription is not necessary for activation of a/kA. Ye, S. and Larson, T.J., 1988, Structures of the promoter
Proc. Natl. Acad. Sci. USA 88, 9001-9005. and operator of the glpD gene encoding aerobic sn-
S~gaard-Andersen, L., Pedersen, H., Holst, B. and Valentin- giycerol-3-phosphate dehydrogenase of Escherichia coli
Hansen, P., 1991, A novel function of the cAMP-CRP K-12. J. Bacteriol. 170, 4209-4215.

You might also like