You are on page 1of 2

Text and Speech Corpus for Indonesian LVCSR

1. Background
In the period of August 2005 April 2006, a joint research team: TEL !" #$% &enter 'TEL !"#isTI( Indonesia as the leader, Tel)om *chool of Engineering '*TT Tel)om( Indonesia, and Ad+anced Telecommunication #esearch 'AT#( ,apan, conducted a research on de+elopment of Te-t and *peech corpus for Indonesian Large .oca/ular0 &ontinuous *peech #ecognition 'L.&*#(1 The project is funded /0 the 2005 round of A2T 3#% 2rogram for E-change of I&T #esearchers and Engineers1 The results of the project are: Te-t source of 500 ne4s domain sentences, Le-icon dictionar0 of 5615 4ords, T4o sentence sets consist of 2,500 sentences of application domain and tri7phone /alanced 5,896 sentences of ne4s domain, and 9: spo)en sentences 'utterances( for clean and telephon01 The o4ner of the all results is TEL !" #$% &enter 'TEL !"#isTI( Indonesia as the0 are the project leader1

2. Text Corpus
The te-t corpus co+ers t4o domains: 8( application domain 'fi+e e-isting, running applications in TEL !"#isTI: %irector0 *er+ices for 3earing and *pea)ing impaired telecommunication ser+ice, Tele7home securit0, ;illing information *er+ices, #eser+ation ser+ices, and *tatus trac)ing feature of e7<o+ ser+ices( consists of 2,500 sentences= and 2( news domain 'from one 0ear, in 2008, headline ne4s from t4o 4idel0 read Indonesian dail0 ne4spapers: !"2A* and TE"2!( consists of 500 sentences1 Each sentence in /oth domains contains ma-imum 80 4ords for reada/ilit0 reason1 A dictionar0 4as de+eloped from /oth domain te-t sources and co+ers 5615 4ords 'around 5015 nati+e Indonesian 4ords and 6 terms of person>place names( 4here the le-icon 4as de+eloped /0 an Indonesian language e-pert1 Tri7phone /alanced sentence set '5,896 sentences( is e-tracted from the 500 ne4s domain te-t source1 This set co+ers ?,669 distinct tri7phones1 After4ard, the sentence sets from /oth domains are com/ined and distri/uted into 800 sentence lists1 The 2,500 application domain sentences are di+ided into 800 sets 4here each set consists of 800 sentences 4ith o+erlap ratio of 65@, /ut the 5,896 ne4s domain sentences are di+ided into 800 sets 4here each set consists of 880 sentences 4ith o+erlap ratio of 60@1 Thus, each sentence list contains 280 sentences '800 application domain sentences and 880 ne4s domain sentences(1 These sentence lists 4ill /e read /0 :00 spea)ers1

3. Speech Corpus
3.1 Speakers
There are 400 spea)ers distri/uted /0 gender ' 201 males and 199 emales(, age '20@ for 89725 0ears old, :0@ for 2:755 0ears old, 50@ for 56750 0ears, and 80@ for 58760 0ears old(, and four

major 4estern Indonesia accents '86165@ ;ata), 2915@ ,a)arta, 28@ ,a+anese, and 55165@ *undanese(1

3.2 Soundproo room


The specifications of the soundproof room and recording eAuipment are as follo4: 81 *oundproof parameter: a( *ound insulation le+el: 50 d; /( ;ac)ground noise le+el: 22 d; c( #e+er/eration time: 0185 second 21 *oundproof design a( Length: 290 cm /( Bidth: 220 cm c( 3eight: 260 cm d( Thic)ness: 2615 5215 cm 51 #ecording eAuipment The recording eAuipment 4as configured as such to ena/le the recording of clean speech 'microphone source( and telephone speech1 ;0 follo4ing strict AT# reAuirement, it is e-pected that noise, mainl0 generated /0 electricit0, 4ill /e ma-imall0 reduced so that the recording result 4ill /e lo4 noise1

3.3 Speech Corpus Si!e


Each spea)er uttered 280 sentences1 Thus, there are :00 - 280 C "4#000 utterances1 ;0 using mono channel, 4" k$! freAuenc0 sampling, 1% &its AuantiDation le+el, and '() file format, the corpus siDe is around 2% *iga B+tes 4ith duration of around ,- hours1

You might also like