A 02

Uploaded by

dsa

0% found this document useful (0 votes)

106 views2 pages

language engineering

Original Title

a02

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

language engineering

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

106 views2 pages

A 02

Uploaded by

dsa

language engineering

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

DD1418/DD2418 Language Engineering

2017-11-06

Assignment 2
Readings: Read chapter 4 in Jurafsky-Martin.

Code: The skeleton code can be downloaded from Canvas or from

http://www.csc.kth.se/~jboye/teaching/language engineering/a02/LanguageModels.zip
Unzip the code in your home directory. Go to the folder LanguageModels and type:
pip install -r requirements.txt
Now everything needed for the assignment should be installed.

Problems:

1. We want a program that computes all bigram probabilites from a given (training) corpus,
and stores it in a file. For instance, from the file data/small.txt:

I live in Boston.
I like ants.
Ants like honey.
Therefore I like honey too!

we want to produce the contents of the file small model correct.txt. Note that:

The first line contains two numbers, separated by space: The vocabulary size V (= the
number of unique tokens, including punctuation), and the size of the corpus N (= the
total number of tokens).
Then follows V lines, each containing three items: an identifier (0, 1, . . .), a token, and
the number of times that token appears in the corpus.
Then follows a number of lines, one for each non-zero bigram probability. Each line
contains three numbers: The identifiers of the first and second token of the bigram, re-
spectively, followed by the logarithm of the bigram probability, printed with 15 decimals.
The natural logarithm is used (as computed by the math.log library method).
The final line is -1 to mark end-of-file.

The BigramTrainer.py program contains a skeleton program for reading a corpus, computing
unigram counts and bigram probabilities, and printing the model. Your task is to extend
the code so that the program works correctly (look for the comments YOUR CODE HERE
in the program). Use the scripts run trainer small.sh and run trainer kafka.sh to run
the program on test examples.
You can use the -d option to save the model to file, e.g. :

python BigramTrainer.py -f data/kafka.txt -d kafka model.txt

If you are using Windows, printing the model to the terminal will likely lead to character
encoding errors.
By adding the --check flag, you can verify that your results are correct.

python BigramTrainer.py -f data/kafka.txt --check

2. The BigramTester.py program contains a skeleton program for reading a model on the
format described in the previous problem, reading a test corpus, and computing the entropy
of the test corpus given the model (the cross-entropy of the training set and the test set).

(a) Extend the code so that the program works correctly (look for the comments YOUR
CODE HERE in the program). The entropy of the test set is computed as the average
log-probability:
N
1 X
log P (wi1 wi )
N i=1
where N is the number of tokens in the test corpus. To be able to handle missing words
and missing bigrams, use linear interpolation:

P (wi1 wi ) = 1 P (wi |wi1 ) + 2 P (wi ) + 3

The values for the constants 1 , 2 and 3 are given in the code for the BigramTester
program. The script run tester small kafka.sh tests the model built from small.txt
using kafka.txt as a test corpus, and the script run tester kafka small.sh tests the
model build from kafka.txt on the test corpus small.txt. Compare with my numbers
by using the --check flag. (Your numbers might deviate slightly from my numbers; for
instance, if you are using a different logarithm. I used the natural logarithm.)
(b) Build a model from the file data/guardian training.txt and another model from
data/austen training.txt. Compute the entropy of the test file guardian test.txt
and the test file austen test.txt, using both models. Report your numbers and your
conclusions from these experiments!

NLP Programming Assignment 1: Hidden Markov Models
Document4 pages
NLP Programming Assignment 1: Hidden Markov Models
Yuchen
No ratings yet
CS 2073 Lab 10: Matrix Multiplication Using Pointers: I Objectives
Document2 pages
CS 2073 Lab 10: Matrix Multiplication Using Pointers: I Objectives
Sarfaraz Akhtar
No ratings yet
Python Programming Lab
Document4 pages
Python Programming Lab
tejvarmavarma
No ratings yet
Assignment 1
Document6 pages
Assignment 1
peterfarouk01
No ratings yet
CSCI 5454 Algorithms Problem Set Solutions
Document10 pages
CSCI 5454 Algorithms Problem Set Solutions
Ramnarayan Shreyas
No ratings yet
Eustaquio, John Patrick A. Prof. Leonilla Elemento MEE-21: Experiment 3: Variables
Document4 pages
Eustaquio, John Patrick A. Prof. Leonilla Elemento MEE-21: Experiment 3: Variables
John Patrick Eustaquio
No ratings yet
A 2 C
Document5 pages
A 2 C
Sheethala Swaminathan
No ratings yet
Beginner Fortran 90 Tutorial: 1 Basic Program Structure in Fortran
Document12 pages
Beginner Fortran 90 Tutorial: 1 Basic Program Structure in Fortran
KURNIAWAN
No ratings yet
MIT6 189IAP11 hw2
Document8 pages
MIT6 189IAP11 hw2
Ali Akhavan
No ratings yet
TaskSheet3 - Computing 1
Document3 pages
TaskSheet3 - Computing 1
Dominic Wynes-Devlin
No ratings yet
CSI2110 Assignment 1 - Parallel Programming Examples
Document5 pages
CSI2110 Assignment 1 - Parallel Programming Examples
Changwook Jung
No ratings yet
Notes Code
Document8 pages
Notes Code
Ismael Arce
No ratings yet
Python Notes
Document103 pages
Python Notes
kasarla rakesh
No ratings yet
01.basics Understanding and Data Types, Variables & Operators - Jupyter Notebook
Document42 pages
01.basics Understanding and Data Types, Variables & Operators - Jupyter Notebook
kasarla rakesh
No ratings yet
ISTA 130: Fall 2020 Programming Assignment 2 Functions
Document7 pages
ISTA 130: Fall 2020 Programming Assignment 2 Functions
tts
No ratings yet
Paren Lab
Document5 pages
Paren Lab
Shobiitaa Krish
No ratings yet
MIT6 0001F16 ProblemSet0 PDF
Document4 pages
MIT6 0001F16 ProblemSet0 PDF
Avinash Shivaswamy
No ratings yet
Python Assignment
Document6 pages
Python Assignment
Luke Robinson
No ratings yet
Rcs454: Python Language Programming LAB: Write A Python Program To
Document39 pages
Rcs454: Python Language Programming LAB: Write A Python Program To
Shikha Arya
No ratings yet
IC152 Lab Assignment 6
Document10 pages
IC152 Lab Assignment 6
Badal Gupta
No ratings yet
Module 2: Problem Solving Techniques Unit 1
Document7 pages
Module 2: Problem Solving Techniques Unit 1
Alipriya Chatterjee
No ratings yet
Sparse autoencoder implementation for natural images
Document5 pages
Sparse autoencoder implementation for natural images
Jose
No ratings yet
Wrong answer? Add assertions and generate tests
Document5 pages
Wrong answer? Add assertions and generate tests
Syed Khoab
No ratings yet
Chapter-2: Elements of Computer Programming-I
Document17 pages
Chapter-2: Elements of Computer Programming-I
Rolando Carlos Flores
No ratings yet
Computer Programming Laboratory Work
Document6 pages
Computer Programming Laboratory Work
Maia Zaica
No ratings yet
Overrun
Document3 pages
Overrun
Công Sơn
No ratings yet
Lab 05
Document2 pages
Lab 05
vinayakagarwal
No ratings yet
Lab 04
Document3 pages
Lab 04
myamo2613
No ratings yet
Assignment 4 - Makefiles and Function Pointers: The Problems of This Assignment Must Be Solved in C
Document3 pages
Assignment 4 - Makefiles and Function Pointers: The Problems of This Assignment Must Be Solved in C
Weiguo Wang
No ratings yet
Tar How-To: Practical 3 2012
Document3 pages
Tar How-To: Practical 3 2012
Calvin Maree
No ratings yet
Master in High Performance Computing Advanced Parallel Programming LABS
Document2 pages
Master in High Performance Computing Advanced Parallel Programming LABS
nijota1
No ratings yet
Course Project: Due: Tuesday, 03/03/09
Document5 pages
Course Project: Due: Tuesday, 03/03/09
Irfina Imran
No ratings yet
Practice Exercise (Loops)
Document6 pages
Practice Exercise (Loops)
Muhammad Nasir Awan Zakariyan
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
Document6 pages
COMP 4650 6490 Assignment 3 2023-v1.1
390942959
No ratings yet
The Tip of The Iceberg: 1 Before You Start
Document18 pages
The Tip of The Iceberg: 1 Before You Start
regupathi6413
No ratings yet
100 Skills To Better Python
Document80 pages
100 Skills To Better Python
aerofit
100% (2)
Robocoupler Report
Document9 pages
Robocoupler Report
Pavan Kumar
No ratings yet
Assignment 5: Raw Memory: Bits and Bytes
Document6 pages
Assignment 5: Raw Memory: Bits and Bytes
nik
No ratings yet
Modify code and learn from mistakes
Document37 pages
Modify code and learn from mistakes
Devi Vara Prasad
25% (4)
Numerical project (Problem set 7): analyzing sequence data
Document3 pages
Numerical project (Problem set 7): analyzing sequence data
Jamie DuGettho
No ratings yet
Ss&Os Laboratory Manual
Document27 pages
Ss&Os Laboratory Manual
Janhavi Vishwanath
No ratings yet
Python Revision: Key Concepts and Code Samples
Document10 pages
Python Revision: Key Concepts and Code Samples
Rishi
No ratings yet
Lab 3: Unix and Linux
Document3 pages
Lab 3: Unix and Linux
Michael Fiorelli
No ratings yet
CS 211: Computer Architecture, Summer 2021 Programming Assignment 3: Cache Simulator (100 Points)
Document4 pages
CS 211: Computer Architecture, Summer 2021 Programming Assignment 3: Cache Simulator (100 Points)
Muhammad Hammad Bashir
No ratings yet
Improved Code-Generation Facilities: About Me
Document13 pages
Improved Code-Generation Facilities: About Me
Rupesh Harode
No ratings yet
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
Document8 pages
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
Achuthan Sekar
No ratings yet
Python Document
Document25 pages
Python Document
EAIESB
No ratings yet
Java Programming Tutorial 1
Document10 pages
Java Programming Tutorial 1
Akisseh Ngunde Nnam
No ratings yet
Lab Task # 2 Introduction To Programming With Python
Document5 pages
Lab Task # 2 Introduction To Programming With Python
Maryam
No ratings yet
Proj 2
Document3 pages
Proj 2
lorentzongustaf
No ratings yet
Lec 2 PDF
Document28 pages
Lec 2 PDF
ziadmohamad3412
No ratings yet
CS 11 - Machine Problem 2 PDF
Document3 pages
CS 11 - Machine Problem 2 PDF
eduardo edrada
No ratings yet
Exercises
Document10 pages
Exercises
Seraphina Christoph Liang
No ratings yet
Lecture "Programming" SS 2020 Problem Set 5: Exercise 1 (C++) Static & Dynamic Arrays
Document7 pages
Lecture "Programming" SS 2020 Problem Set 5: Exercise 1 (C++) Static & Dynamic Arrays
david Abotsitse
No ratings yet
Python Interview Questions: Answer: in Duck Typing, One Is Concerned With Just Those Aspects of An Object That Are
Document12 pages
Python Interview Questions: Answer: in Duck Typing, One Is Concerned With Just Those Aspects of An Object That Are
Kowshik Chakraborty
No ratings yet
Lab 2 Variable and Data Type I
Document9 pages
Lab 2 Variable and Data Type I
Deepak Malusare
No ratings yet
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
From Everand
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
Michael Stueben
No ratings yet
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
C Programmin Language
From Everand
C Programmin Language
Knowledge Flow
No ratings yet
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
Document8 pages
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
Daviti Gachechiladze
No ratings yet
Interbus 3HAC023009 001 RevB en
Document64 pages
Interbus 3HAC023009 001 RevB en
Diogo Coelho
No ratings yet
Kenwood RXD 302 A31 Service Manual
Document27 pages
Kenwood RXD 302 A31 Service Manual
barabbaladro
100% (1)
Diagrama Hidrauilco CT20
Document2 pages
Diagrama Hidrauilco CT20
Jorge Fernandez
100% (1)
Reliabilityweb Uptime 20110405
Document69 pages
Reliabilityweb Uptime 20110405
catraio
No ratings yet
Laminación de Alambrón 5,5 MM
Document10 pages
Laminación de Alambrón 5,5 MM
Debora Chavez
No ratings yet
2013 Introducing PBN A RNP
Document20 pages
2013 Introducing PBN A RNP
ushami
100% (1)
Penguard FC Technical Data Sheet
Document5 pages
Penguard FC Technical Data Sheet
Gurdeep Sungh Arora
No ratings yet
Avaya P332MF 3.12 UG
Document182 pages
Avaya P332MF 3.12 UG
bratiloveanu
No ratings yet
AD7701 16bit 4ksps Ser
Document20 pages
AD7701 16bit 4ksps Ser
Mahesh Kumar
No ratings yet
Fundamentals of Power Engineering Homework on Transformer Circuit Analysis
Document2 pages
Fundamentals of Power Engineering Homework on Transformer Circuit Analysis
Shabi Hassan
No ratings yet
Batch Distillation Laboratory Report
Document17 pages
Batch Distillation Laboratory Report
Nayantara Soni
100% (1)
Pioneer Deh 3100r Deh 3130r
Document69 pages
Pioneer Deh 3100r Deh 3130r
evagualda
No ratings yet
Omni Bundle - Workbook (Instructor)
Document114 pages
Omni Bundle - Workbook (Instructor)
Rohit Gandhi
100% (3)
GTB-BOL Vetrificado
Document19 pages
GTB-BOL Vetrificado
eldueno
No ratings yet
Heavy Gear DP9-008 - Tactical Air Support PDF
Document114 pages
Heavy Gear DP9-008 - Tactical Air Support PDF
TaekyongKim
100% (2)
TES-CET Syllabus
Document2 pages
TES-CET Syllabus
UpendraSingh
No ratings yet
Ways To Break Out: of The Reactive Maintenance Cycle of Doom
Document52 pages
Ways To Break Out: of The Reactive Maintenance Cycle of Doom
Cristian S
No ratings yet
Classification of Crude Oil: API Gravity, Sulfur Content & Chemical Composition
Document16 pages
Classification of Crude Oil: API Gravity, Sulfur Content & Chemical Composition
nikhil
100% (1)
(Answer Key) SSC Combined Graduate Level (Tier-I) Exam - 2012 (Held On 1-7-2012) - SSCPORTAL - in - SSC CGL, FCI, Govt Exam
Document14 pages
(Answer Key) SSC Combined Graduate Level (Tier-I) Exam - 2012 (Held On 1-7-2012) - SSCPORTAL - in - SSC CGL, FCI, Govt Exam
Srinivas Rao G
No ratings yet
Payment
Document42 pages
Payment
api-26670747
No ratings yet
Active Transducer
Document29 pages
Active Transducer
Babasrinivas Guduru
No ratings yet
Campfire Programme: Grade IV Grade Five Grade Six
Document3 pages
Campfire Programme: Grade IV Grade Five Grade Six
joy marga
No ratings yet
Rules Renewable Energy Jordan
Document21 pages
Rules Renewable Energy Jordan
Hamzeh Al-Qaisi
No ratings yet
FLUOR Lesson03 Instrumentation
Document107 pages
FLUOR Lesson03 Instrumentation
Hernâni Cruz
No ratings yet
Design and Analysis of Two Wheeler Alloy Wheel Rim Using Two Different Materials
Document6 pages
Design and Analysis of Two Wheeler Alloy Wheel Rim Using Two Different Materials
IJRASETPublications
No ratings yet
Jill Manning: Education
Document1 page
Jill Manning: Education
Jill Manning
No ratings yet
Course: Java Programming I Unit: Topic: The For Loop: Single and Nested Class Duration: 50 Minutes Grade Level: Objective
Document2 pages
Course: Java Programming I Unit: Topic: The For Loop: Single and Nested Class Duration: 50 Minutes Grade Level: Objective
RJ Casildo
80% (5)
MGE Motors: Installation and Operating Instructions
Document76 pages
MGE Motors: Installation and Operating Instructions
Jocelito Maia
No ratings yet
A Mobile Database Is A Database That Can Be Connected To by A Mobile Computing Device Over A Mobile Network
Document4 pages
A Mobile Database Is A Database That Can Be Connected To by A Mobile Computing Device Over A Mobile Network
Aman Guglani
No ratings yet