You are on page 1of 21

Practical Application:

Web Programming and Bioinformatics

1
14. Web Programming and Bioinformatics
In Section 3.2, we learned how the syntax of Pascal programming language can be defined with a CFG. (Actually,
Appendix A shows the whole definition of Pascal programming language in terms of a syntax flow graph.) Other
programming languages can also be defined formally with a grammar, except for a few exceptional cases of context
dependency, such as double declaration of a variable in a block and using numeric labels in DO-loop in FORTRAN.
This chapter first shows that the major parts of the popular Web programming language HTML (Hypertext Markup
Language) and XML (Extensible Markup Language), which is designed for e-commerce, can also be defined in terms of
a CFG. Then the chapter presents a potential application of the formal languages to molecular biology.

14.1 Hyper Text Markup Language (HTML) 415


14.2 Document Type Definition (DTD) and XML 418
14.3 Genetic code and grammar 425

2
Web Programming and Bioinformatics
14.1 Hyper Text Markup Language (HTML)
In this section, to see how our knowledge in the formal languages could be applicable
to investigating the properties of programming languages, we shall first examine
HTML. The left box below shows an informal definition (itemized for convenience) of
an HTML list that would commonly appear in a text. To the right are CFG rules
translated from the informal definition.

(1) Char denotes a character. (1) Char → a | A | . . . . . | z | Z


(2) Text is a sequence of |..
characters. (2) Text → Char Text | ε
(3) Doc is a sequence of
(3) Doc → Element Doc | ε
Elements.
(4) For a string x, we call <x> a (4) Tag → <Text> | </Text>
tag, and </x> the matching tag of (5) Element → Text |
<x>. a Doc in-between a matching tag
(5) Element is a Text, or a Doc in-
| <Text>Doc
between a matching tag, or a Doc
with a tag at the front. (6) ListItem → <LI>Doc
(6) ListItem is a document with (7) List → ListItem List | ε
tag <LI> at the front. ListItem is an
item of a list. 3
(7) List is a sequence of zero or
HTML Web Programming and Bioinformatics
Notice that in the CFG rules, the bold-faced Italic words denote nonterminal
symbols. The blank between two nonterminals (e.g., Char Text) is used to delimit the
two nonterminals (not for the terminal symbol blank) and hence, can be ignored. While
translating, we found a context dependent part of the language, i.e., “Element is a Doc
between a matching tag,” that is impossible to translated into a context-free grammar
rule (see the highlighted part).

(1) Char denotes a character.


(1) Char → a | A | . . . . . | z | Z | . .
(2) Text is a sequence of
characters. (2) Text → CharText | ε
(3) Doc is a sequence of (3) Doc → ElementDoc | ε
Elements. (4) Tag → <Text> | </Text>
(4) For a string x, we call <x> a
tag, and </x> the matching tag of (5) Element → Text |
<x>. a Doc between a matching tag
(5) Element is a Text, or a Doc in- | TextDoc
between a matching tag, or a Doc (6) ListItem → <LI>Doc
with a tag at the front.
(7) List → ListItem List | ε
(6) ListItem is a document with
tag <LI> at the front. ListItem is an
item of a list. 4
(7) List is a sequence of zero or
Web Programming and Bioinformatics
HTML
According to the informal definition, a document Doc should appear between a
matching tag, like <x>Doc</x> , where x is a text (i.e., a string). In the formal
definition, we need a rule with Element which can generate <x>Doc</x> for an
arbitrary Text x. Notice that Element → <Text>Doc</Text> does not work, because
we cannot guarantee the left side Text and the right side Text generate identical
strings.
Actually, it is not hard to see that the set of documents with matching tags has the
same context dependency as in the language L = {xcx | x ∈ {a, b}*}. Using the
pumping lemma for context-free languages, we can prove that this language L is not
context-free (see Exercise 12.4). This implies that the list part of HTML is not
context-free. However, if we restrict the matching tags to a finite set of strings, say
{w1, w2, . . , wk} (e.g., <EM>Doc</EM>, <OL>Doc</OL>, etc.), HTML list can be
defined in terms of a “pure” CFG as shown below.

(1) Char → a | A | . . . . . | z | Z | . . . (2) Text → Char Text | ε


(3) Doc → Element Doc | ε (4) Tag → <Text> | </Text>
(5) Element → Text | <Text>Doc | <w1>Doc</w1> | . . . . | <wk>Doc</wk>
(6) ListItem → <“LI”>Doc (7) List → ListItem List | ε

5
Web Programming and Bioinformatics
14.2 DTD and XML
The major objective of HTML is to define the format of a document, and hence it is
not easy to describe the semantics of the text, that is needed in the area of e-
commerce. XML has been introduced to overcome this drawback. Actually, XML is
the language (i.e., the set of documents) written according to a DTD (Document Type
Definition), which can be transformed into a CFG. All DTD’s have the following
form.
<!DOCTYPE name-of-DTD [ List-of-Element-Definitions ]>,
where each element definition has the following format.
<!ELEMENT element-name (Element-Description)>

This format can be transformed into the following CFG rules, where DTD corresponds
to the start symbol S.
DTD → <!DOCTYPE name-of-DTD [ List-of-Element-Definitions ]>
List-of-Element-Definitions → Element-Definition List-of-Element-Definitions
| Element-Definition
Element-Definition → <!ELEMENT element-name (Element-Description)>

6
DTD and XML Web Programming and Bioinformatics

DTD → <!DOCTYPE name-of-DTD [ List-of-Element-Definitions ]>


List-of-Element-Definitions → Element-Definition List-of-Element-Definitions
| Element-Definition
Element-Definition → <!ELEMENT element-name (Element-Description)>

The List-of-Element-Definitions and Element-Description are, respectively,


delimited by the pairs of the matching tags <name-of-DTD> and </name-of-DTD>,
and <element-name> and </element-name>. Element description is a variation of the
regular expression that is defined as follows.
(1) Both (a) and (b) below are element descriptions.
(a) An element-name.
(b) Any text (with no tag) that is denoted by the special term #PCDATA.
(2) If E1 and E2 are element descriptions, E1*, E1+, E1?, (E1, E2), and E1 | E2
are element descriptions.
Recall that E1+ = E1(E1)*, E1? = E1+ε , E1 | E2 = E1+E2, and E1,E2 correspond
to the concatenation of regular expression E1 and E2. We know that every language
expressible with regular expression can also be defined in terms of a regular grammar,
and every regular grammar is a CFG. Hence, Element-Description is context-free.
7
Web Programming and Bioinformatics
DTD and XML
Here is a popular example of a DTD, which defines the specification of PC’s.

<!DOCTYPE PcSpecs [
<!ELEMENT PCS (PC*)>
<!ELEMENT PC (MODEL, PRICE, PROCESSOR, RAM, DISK+)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT PRICE (\#PCDATA)>
<!ELEMENT PROCESSOR (MANF, MODEL, SPEED)>
<!ELEMENT MANF (\#PCDATA)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT SPEED (\#PCDATA)>
<!ELEMENT RAM (\#PCDATA)>
<!ELEMENT DISK (HARDDISK | CD | DVD) >
<!ELEMENT HARDDISK (MANF, MODEL, SIZE)>
<!ELEMENT SIZE (\#PCDATA)>
<!ELEMENT CD (SPEED)>
<!ELEMENT DVD (SPEED)>
]>

8
Web Programming and Bioinformatics
DTD and XML

Here, we show how to transform each element definition of PcSpecs into CFG
rules (shown in the box).

<!ELEMENT PCS (PC*)>


PCS → <PCS>PCSS </PCS> PCSS → PC PCSS | ε

<!ELEMENT PC (MODEL, PRICE, PROCESSOR, RAM, DISK+)>

PC → <PC>MODEL PRICE PROCESSOR RAM DISK+</PC>


DISK+ → <DISK+>DISKS</DISK+> DISKS → DISK DISKS | DISK

<!ELEMENT MODEL (\#PCDATA)> MODEL → <MODEL>Text</MODEL>

<!ELEMENT PRICE (\#PCDATA)> PRICE → <PRICE>Text</PRICE>

9
Web Programming and Bioinformatics
DTD and XML

<!ELEMENT PROCESSOR (MANF, MODEL, SPEED)>

PROCESSOR → <PROCESSOR>MANF MODEL SPEED</PROCESSOR>

<!ELEMENT MANF (\#PCDATA)> MANF → <MANF>Text</MANF>

<!ELEMENT MODEL (\#PCDATA)> MODEL → <MODEL>Text</MODEL>

.....

<!ELEMENT DISK (HARDDISK | CD | DVD) >

DISK → <DISK>HARDDISK</DISK> |
<DISK>CD</DISK> | <DISK>DVD</DISK>

.....
10
Web Programming and Bioinformatics
DTD and XML

Here is an example of XML document written according to the DTD PcSpecs.

<PCS>
<PC>
<MODEL>1234</MODEL>
<PRICE>$3000</PRICE>
<PROCESSOR>
<RAM>512</RAM>
<DISK><HARDDISK>
<MANF>Superdc</MANF>
<MODEL>xx1000</MODEL>
<SIZE>62Gb</SIZE>
</HARDDISK></DISK>
<DISK><CD>
<SPEED>32x</SPEED>
</CD></DISK>
</PC>
<PC> . . . . </PC>
</PCS>
11
Web Programming and Bioinformatics
DTD and XML

Again in XML we see the same context dependency caused by the matching tags.
As in the previous example, if we restrict the matching tags to those chosen from a
finite set of reserved words, such as MODES, PRICE, DISK, etc., XML turns out to
be a CFL.

Warning Signs
Break Time
- On children's alphabet blocks: Letters may be used to construct words, phrases and sentences that may be
deemed offensive.
- On a microscope: Objects are smaller and less alarming than they appear.
- On a revolving door: Passenger compartments for individual use only.
- On work gloves: For best results, do not leave at crime scene.
- On a palm sander: Not to be used to sand palms.
- On a calendar: Use of term "Sunday" for reference only. No meteorological warranties express or implied.
-On a blender: Not for use as an aquarium.
- Dennis -

12
Web Programming and Bioinformatics
14.3 Gene code and grammar
Quite a few linguistic terminologies, such as code, express, translate, edit, etc., have
been adopted by biologists, especially in the field of molecular biology, ever since
James Watson and Francis Crick proposed the structure of DNA (deoxyribonucleic
acid) strands in 1953. This implies that in biology there are problems that we can
approach with the knowledge of automata and formal languages. As an example, this
section shows how it is possible to express genes in terms of a grammar.
Before the example, we need a very brief introduction to the process of protein
synthesis. A genome is a long double helix DNA strand, consisting of four chemical
bases; adenine, thymine, cytosine and guanine, denoted by A, T, C, and G, respectively.

Genome (double helix)

DNA molecule backbone


A G T C
bases
T C A G

13
Gene Grammar Web Programming and Bioinformatics

Each base in a double helix forms one of the four complementary pairs, A-T, T-A,
C-G, and G-C as the following example shows. (Thus, given a DNA single strand,
we can figure out its complementary strand.)
ACTTAACGGCGAT
TGAATTGCCGCTA
Proteins are synthesized by the following three steps; (1) a single strand segment
(i.e., gene) of the genome is transcribed into an RNA. The RNA is the complementary
strand of the source strand, with every base T replaced by U (uracil). Hence, RNA’s
consist of four bases, A, C, G and U.

5’ end TCAAGCT
5’ end
UCAAGCU
3’ end
3’ end AGTTCGA

RNA
(1) RNA Transcription

14
Gene Grammar Web Programming and Bioinformatics

(2) In eukaryotes, only useful substrings for protein synthesis, called exons, are
spliced (i.e., cut off and concatenated) to form a mRNA (messenger RNA). (3)
Finally, ribosomes, large molecular assemblies traveling along the mRNA, translate
the mRNA into a protein, which is an amino-acid sequence.

exons introns
Transcribed RNA

(2) splicing
Messenger RNA

(3) Translation
Protein

15
Gene Grammar Web Programming and Bioinformatics

There are 20 different amino acids. Hence, proteins can be represented as a string of
20 characters. The translation is done by triplets (i.e., groups of three adjacent bases),
also called codons, according to the table shown below. The translation begins with the
start codon ATG, and continues until it meets one of the stop codons, TAA, TAG, and
TGA (see the table below). For every codon scanned, a specific amino acid is attached
to the next position of the protein being synthesized. Notice that because there are
4× 4× 4 = 64 different codons and 20 amino acids, several codons are translated into the
same amino acid. (Following convention, we shall represent codons using character T
rather than U.)

Ala Arg Asp Ans Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Var Stop
GCA AGA GAT AAT TGT GAA CAA GGA CAT ATA TTA AAA ATG TTT CCA AGT ACA TGG TAT GTA TAA
AAC TGC GAG CAG GGG CAC ATT TTG AAG (start) TTC CCG AGC ACG TAC GTG TAG
GCG AGG GAC
GGT ATC CTA CCT TCA ACT GTT TGA
GCT CGA GGC CTG CCC TCG ACC GTC
CTT TCT
GCC CGG
CTC TCC
CGT
CGC

16
Gene Grammar Web Programming and Bioinformatics

This three step process for protein synthesis is called the central dogma in
molecular biology, because with so many exceptional cases, it is no longer accepted as
a valid theory. However, to show a potential application of the formal languages and
automata theory to molecular biology, we shall present an example based on this
dogma. As we did in Section 14.1 for HTML, we will transform the informal definition
of genes into a formal definition in terms of a grammar. (This example is an excerpt,
lightly edited, from David Sears’ paper “The Linguistics of DNA,” which appeared in
the journal American Scientist, Vol. 80, 1992.)
• The transcript, which is the part of the gene that is copied into a messenger RNA,
has flanking 5’- and 3’-untranslated-regions around a coding-region initiated by a
start-codon.

<gene> → <transcript>
<transcript> → <5’-untranslated-region><start-codon><coding-region>
<3’-untranslated-region>

17
Gene Grammar Web Programming and Bioinformatics

• The coding-region consists of a codon followed by more coding-region, a recursion


that ultimately terminates in a stop-codon. The coding-region also allows for a splice at
any point in the recursion, resulting in the insertion of an intron, which is bracketed by a
splicing signals GT and AG.

<coding-region> → <codon><coding-region> | <splice><coding-region>


| <stop-codon>
<splice> → <intron> <intron> → GT<intron-body>AG

• A stop-codon is any of the three triplets TAA, TAG and TGA, whereas a codon is
any of the remaining 61 possible triplets that are translated into an amino acid
according to the rule given in the codon table.

<start-codon> → <Met> <stop-codon> → TAA | TAG | TGA

<codon> → <Ala> | <Arg> | <Asp> | <Asn>| <Cys>| <Glu>| <Gln> |


<Gly> | <His> | <Ile> | <Leu>| <Lys> | <Met>| <Phe>|
<Pro> | <Ser> | <Thr> | <Typ> | <Tyr> | <Val>

18
Gene Grammar Web Programming and Bioinformatics
• The first two bases of a codon partially specify the amino acid into which it is
translated. Every codon starting with GC (CG) is translated into alanine (respectively,
arginine), independent of the third base. Every codon starting with GA is translated into
aspartic acid if the third base is a pyrimidine (i.e., C or T). Otherwise, if the third base
is a purine (i.e., A or G), it is translated into glutamic acid. Here are some formal
representations of such translation rules specified in the codon table.

<Ala> → GC<base> <Arg> → CG<base> <Asp> → GA<pyrimidine>


<Cys> → TG<pyrimidine> <Glu> → GA<purine> <Gln> → CA<purine>
<base> → A | C | G | T <pyrimidine> → C | T <purine> → A | G

All the grammar rules presented up to this point are context-free. But the following
biological fact requires a rule from the upper level in the Chomsky hierarchy.
• Introns can be transposed to the left or right.

<base><intron> → <intron><base> <intron><base> → <base><intron>

The following slide shows a grammar that specifies the set of “genes” that we have
developed in this section.
19
Gene Grammar Web Programming and Bioinformatics

<gene> → <transcript>
<transcript> → <5’-untranslated-region><start-codon><coding-region><3’-untranslated-region>
<coding-region> → <codon><coding-region> | <splice><coding-region> | <stop-codon>
---------------------------------------------------------------------------
<codon> → <Ala> | <Arg> | <Asp> | <Asn>| <Cys>| <Glu>| <Gln> | <Gly> | <His> | <Ile>|
<Leu>| <Lys>| <Met>| <Phe>| <Pro>| <Ser>| <Thr>| <Typ>| <Tyr>| <Val>
<start-codon> → <Met> <stop-codon> → TAA | TAG | TGA
-----------------------------------------------------------------------------
<Ala> → GC<base> <Arg> → CG<base> <Asp> → GA<pyrimidine> <Cys> → TG<pyrimidine>
<Glu> → GA<purine> <Gln> → CA<purine> <Gly> → GG<base> <His> → CA<Pyrimidine>
<Ile> → AT<pyrimidine> | ATA <Leu> → TT<purine> | CT<base> <Lys> → AA<purine>
<Met> → ATG <Phe> → TT<primidine> <Pro> → CC<base> <Thr> → AC<base>
<Ser> → AG<primidine> | TC<base> <Typ> → TGG <Tyr> → TA<pyrimidine> <Val> → GT<base>
-------------------------------------------------------------------------------
<base> → A | T | G | C <pyrimidine> → C | T <purine> → A | G
-------------------------------------------------------------------------------
<splice> → <intron> <intron> → GT<inton-body>AG
-------------------------------------------------------------------------------
<intron><base> → <base><intron> <base><intron> → <intron><base>

20
Gene Grammar Web Programming and Bioinformatics

The grammar that we have just developed is incomplete because no rules are defined
for the nonterminal symbols <5’-untranslated-region>, <3’-untranslated-region> and
<intron-body>. The two untranslated regions flanking the coding-region usually
contain segments regulating the gene expression. The contents and the function of
these parts are not yet completely understood. The intron-body parts generally consist
of a long repeating string, whose exact role is also under investigation. We need more
research before we can present a formal model for these regions.

Love Break Time

What else is love but understanding and rejoicing in the fact that another person lives, acts, and experiences
otherwise than we do….? - Friedrich Nietzsche –

Just because you love someone doesn’t mean you have to be involved with them. Love is not a bandage to cover
wounds. - Hugh Elliot -

21

You might also like