Professional Documents
Culture Documents
Language Understanding
Jacob Devlin
Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout}@google.com
Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT
uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-
to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly
conditioned on both left and right context in all layers.
models pre-trained on ImageNet (Deng et al., • BERTLARGE : L=24, H=1024, A=16, Total
2009; Yosinski et al., 2014). Parameters=340M
Token
E[CLS] Emy Edog Eis Ecute E[SEP] Ehe Elikes Eplay E##ing E[SEP]
Embeddings
Segment
Embeddings EA EA EA EA EA EA EB EB EB EB EB
Position
Embeddings E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Figure 2: BERT input representation. The input embeddings is the sum of the token embeddings, the segmentation
embeddings and the position embeddings.
• The first token of every sequence is al- refer to this procedure as a “masked LM” (MLM),
ways the special classification embedding although it is often referred to as a Cloze task in
([CLS]). The final hidden state (i.e., out- the literature (Taylor, 1953). In this case, the fi-
put of Transformer) corresponding to this to- nal hidden vectors corresponding to the mask to-
ken is used as the aggregate sequence rep- kens are fed into an output softmax over the vo-
resentation for classification tasks. For non- cabulary, as in a standard LM. In all of our exper-
classification tasks, this vector is ignored. iments, we mask 15% of all WordPiece tokens in
each sequence at random. In contrast to denoising
• Sentence pairs are packed together into a sin- auto-encoders (Vincent et al., 2008), we only pre-
gle sequence. We differentiate the sentences dict the masked words rather than reconstructing
in two ways. First, we separate them with the entire input.
a special token ([SEP]). Second, we add a Although this does allow us to obtain a bidirec-
learned sentence A embedding to every token tional pre-trained model, there are two downsides
of the first sentence and a sentence B embed- to such an approach. The first is that we are cre-
ding to every token of the second sentence. ating a mismatch between pre-training and fine-
• For single-sentence inputs we only use the tuning, since the [MASK] token is never seen dur-
sentence A embeddings. ing fine-tuning. To mitigate this, we do not always
replace “masked” words with the actual [MASK]
3.3 Pre-training Tasks token. Instead, the training data generator chooses
Unlike Peters et al. (2018) and Radford et al. 15% of tokens at random, e.g., in the sentence my
(2018), we do not use traditional left-to-right or dog is hairy it chooses hairy. It then performs
right-to-left language models to pre-train BERT. the following procedure:
Instead, we pre-train BERT using two novel unsu-
• Rather than always replacing the chosen
pervised prediction tasks, described in this section.
words with [MASK], the data generator will
3.3.1 Task #1: Masked LM do the following:
Intuitively, it is reasonable to believe that a
• 80% of the time: Replace the word with the
deep bidirectional model is strictly more power-
[MASK] token, e.g., my dog is hairy →
ful than either a left-to-right model or the shal-
my dog is [MASK]
low concatenation of a left-to-right and right-to-
left model. Unfortunately, standard conditional • 10% of the time: Replace the word with a
language models can only be trained left-to-right random word, e.g., my dog is hairy → my
or right-to-left, since bidirectional conditioning dog is apple
would allow each word to indirectly “see itself”
in a multi-layered context. • 10% of the time: Keep the word un-
In order to train a deep bidirectional representa- changed, e.g., my dog is hairy → my dog
tion, we take a straightforward approach of mask- is hairy. The purpose of this is to bias the
ing some percentage of the input tokens at random, representation towards the actual observed
and then predicting only those masked tokens. We word.
The Transformer encoder does not know which For the pre-training corpus we use the concatena-
words it will be asked to predict or which have tion of BooksCorpus (800M words) (Zhu et al.,
been replaced by random words, so it is forced to 2015) and English Wikipedia (2,500M words).
keep a distributional contextual representation of For Wikipedia we extract only the text passages
every input token. Additionally, because random and ignore lists, tables, and headers. It is criti-
replacement only occurs for 1.5% of all tokens cal to use a document-level corpus rather than a
(i.e., 10% of 15%), this does not seem to harm the shuffled sentence-level corpus such as the Billion
model’s language understanding capability. Word Benchmark (Chelba et al., 2013) in order to
The second downside of using an MLM is that extract long contiguous sequences.
only 15% of tokens are predicted in each batch, To generate each training input sequence, we
which suggests that more pre-training steps may sample two spans of text from the corpus, which
be required for the model to converge. In Sec- we refer to as “sentences” even though they are
tion 5.3 we demonstrate that MLM does con- typically much longer than single sentences (but
verge marginally slower than a left-to-right model can be shorter also). The first sentence receives
(which predicts every token), but the empirical im- the A embedding and the second receives the B
provements of the MLM model far outweigh the embedding. 50% of the time B is the actual next
increased training cost. sentence that follows A and 50% of the time it is
a random sentence, which is done for the “next
3.3.2 Task #2: Next Sentence Prediction
sentence prediction” task. They are sampled such
Many important downstream tasks such as Ques- that the combined length is ≤ 512 tokens. The
tion Answering (QA) and Natural Language In- LM masking is applied after WordPiece tokeniza-
ference (NLI) are based on understanding the re- tion with a uniform masking rate of 15%, and no
lationship between two text sentences, which is special consideration given to partial word pieces.
not directly captured by language modeling. In We train with batch size of 256 sequences (256
order to train a model that understands sentence sequences * 512 tokens = 128,000 tokens/batch)
relationships, we pre-train a binarized next sen- for 1,000,000 steps, which is approximately 40
tence prediction task that can be trivially gener- epochs over the 3.3 billion word corpus. We
ated from any monolingual corpus. Specifically, use Adam with learning rate of 1e-4, β1 = 0.9,
when choosing the sentences A and B for each pre- β2 = 0.999, L2 weight decay of 0.01, learning
training example, 50% of the time B is the actual rate warmup over the first 10,000 steps, and linear
next sentence that follows A, and 50% of the time decay of the learning rate. We use a dropout prob-
it is a random sentence from the corpus. For ex- ability of 0.1 on all layers. We use a gelu acti-
ample: vation (Hendrycks and Gimpel, 2016) rather than
Input = [CLS] the man went to [MASK] store [SEP] the standard relu, following OpenAI GPT. The
he bought a gallon [MASK] milk [SEP]
training loss is the sum of the mean masked LM
likelihood and mean next sentence prediction like-
Label = IsNext lihood.
Training of BERTBASE was performed on 4
Input = [CLS] the man [MASK] to the store [SEP] Cloud TPUs in Pod configuration (16 TPU chips
penguin [MASK] are flight ##less birds [SEP] total).5 Training of BERTLARGE was performed
on 16 Cloud TPUs (64 TPU chips total). Each pre-
Label = NotNext
training took 4 days to complete.
We choose the NotNext sentences completely at
random, and the final pre-trained model achieves 3.5 Fine-tuning Procedure
97%-98% accuracy at this task. Despite its sim- For sequence-level classification tasks, BERT
plicity, we demonstrate in Section 5.1 that pre- fine-tuning is straightforward. In order to obtain
training towards this task is very beneficial to both a fixed-dimensional pooled representation of the
QA and NLI. input sequence, we take the final hidden state (i.e.,
the output of the Transformer) for the first token
3.4 Pre-training Procedure
5
https://cloudplatform.googleblog.com/2018/06/Cloud-
The pre-training procedure largely follows the ex- TPU-now-offers-preemptible-pricing-and-global-
isting literature on language model pre-training. availability.html
in the input, which by construction corresponds to • GPT uses a sentence separator ([SEP]) and
the the special [CLS] word embedding. We de- classifier token ([CLS]) which are only in-
note this vector as C ∈ RH . The only new pa- troduced at fine-tuning time; BERT learns
rameters added during fine-tuning are for a classi- [SEP], [CLS] and sentence A/B embed-
fication layer W ∈ RK×H , where K is the num- dings during pre-training.
ber of classifier labels. The label probabilities
P ∈ RK are computed with a standard softmax, • GPT was trained for 1M steps with a batch
P = softmax(CW T ). All of the parameters of size of 32,000 words; BERT was trained for
BERT and W are fine-tuned jointly to maximize 1M steps with a batch size of 128,000 words.
the log-probability of the correct label. For span-
• GPT used the same learning rate of 5e-5 for
level and token-level prediction tasks, the above
all fine-tuning experiments; BERT chooses a
procedure must be modified slightly in a task-
task-specific fine-tuning learning rate which
specific manner. Details are given in the corre-
performs the best on the development set.
sponding subsection of Section 4.
For fine-tuning, most model hyperparameters To isolate the effect of these differences, we per-
are the same as in pre-training, with the excep- form ablation experiments in Section 5.1 which
tion of the batch size, learning rate, and number demonstrate that the majority of the improvements
of training epochs. The dropout probability was are in fact coming from the new pre-training tasks.
always kept at 0.1. The optimal hyperparameter
values are task-specific, but we found the follow- 4 Experiments
ing range of possible values to work well across
all tasks: In this section, we present BERT fine-tuning re-
sults on 11 NLP tasks.
• Batch size: 16, 32
• Learning rate (Adam): 5e-5, 3e-5, 2e-5 4.1 GLUE Datasets
• Number of epochs: 3, 4 The General Language Understanding Evaluation
(GLUE) benchmark (Wang et al., 2018) is a col-
We also observed that large data sets (e.g., lection of diverse natural language understand-
100k+ labeled training examples) were far less ing tasks. Most of the GLUE datasets have al-
sensitive to hyperparameter choice than small data ready existed for a number of years, but the pur-
sets. Fine-tuning is typically very fast, so it is rea- pose of GLUE is to (1) distribute these datasets
sonable to simply run an exhaustive search over with canonical Train, Dev, and Test splits, and
the above parameters and choose the model that (2) set up an evaluation server to mitigate issues
performs best on the development set. with evaluation inconsistencies and Test set over-
3.6 Comparison of BERT and OpenAI GPT fitting. GLUE does not distribute labels for the
Test set and users must upload their predictions to
The most comparable existing pre-training method
the GLUE server for evaluation, with limits on the
to BERT is OpenAI GPT, which trains a left-to-
number of submissions.
right Transformer LM on a large text corpus. In
The GLUE benchmark includes the following
fact, many of the design decisions in BERT were
datasets, the descriptions of which were originally
intentionally chosen to be as close to GPT as pos-
summarized in Wang et al. (2018):
sible so that the two methods could be minimally
compared. The core argument of this work is that MNLI Multi-Genre Natural Language Inference
the two novel pre-training tasks presented in Sec- is a large-scale, crowdsourced entailment classifi-
tion 3.3 account for the majority of the empiri- cation task (Williams et al., 2018). Given a pair of
cal improvements, but we do note that there are sentences, the goal is to predict whether the sec-
several other differences between how BERT and ond sentence is an entailment, contradiction, or
GPT were trained: neutral with respect to the first one.
• GPT is trained on the BooksCorpus (800M QQP Quora Question Pairs is a binary classifi-
words); BERT is trained on the BooksCor- cation task where the goal is to determine if two
pus (800M words) and Wikipedia (2,500M questions asked on Quora are semantically equiv-
words). alent (Chen et al., 2018).
Class Class
Label Label
BERT BERT
[CLS]
Tok
... Tok
[SEP]
Tok
... Tok [CLS] Tok 1 Tok 2 ... Tok N
1 N 1 M
BERT BERT
Figure 3: Our task specific models are formed by incorporating BERT with one additional output layer, so a
minimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-level
tasks while (c) and (d) are token-level tasks. In the figure, E represents the input embedding, Ti represents the
contextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the special
symbol to separate non-consecutive token sequences.
QNLI Question Natural Language Inference is the goal is to predict whether an English sentence
a version of the Stanford Question Answering is linguistically “acceptable” or not (Warstadt
Dataset (Rajpurkar et al., 2016) which has been et al., 2018).
converted to a binary classification task (Wang
et al., 2018). The positive examples are (ques-
tion, sentence) pairs which do contain the correct STS-B The Semantic Textual Similarity Bench-
answer, and the negative examples are (question, mark is a collection of sentence pairs drawn from
sentence) from the same paragraph which do not news headlines and other sources (Cer et al.,
contain the answer. 2017). They were annotated with a score from 1
to 5 denoting how similar the two sentences are in
SST-2 The Stanford Sentiment Treebank is a terms of semantic meaning.
binary single-sentence classification task consist-
ing of sentences extracted from movie reviews
with human annotations of their sentiment (Socher MRPC Microsoft Research Paraphrase Corpus
et al., 2013). consists of sentence pairs automatically extracted
from online news sources, with human annotations
CoLA The Corpus of Linguistic Acceptability is for whether the sentences in the pair are semanti-
a binary single-sentence classification task, where cally equivalent (Dolan and Brockett, 2005).
System MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average
392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k -
Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0
BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.9 90.4 36.0 73.3 84.9 56.8 71.0
OpenAI GPT 82.1/81.4 70.3 88.1 91.3 45.4 80.0 82.3 56.0 75.2
BERTBASE 84.6/83.4 71.2 90.1 93.5 52.1 85.8 88.9 66.4 79.6
BERTLARGE 86.7/85.9 72.1 91.1 94.9 60.5 86.5 89.3 70.1 81.9
Table 1: GLUE Test results, scored by the GLUE evaluation server. The number below each task denotes the
number of training examples. The “Average” column is slightly different than the official GLUE score, since
we exclude the problematic WNLI set. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768,
A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT are single-model, single task. All
results obtained from https://gluebenchmark.com/leaderboard and https://blog.openai.
com/language-unsupervised/.
RTE Recognizing Textual Entailment is a bi- small data sets (i.e., some runs would produce de-
nary entailment task similar to MNLI, but with generate results), so we ran several random restarts
much less training data (Bentivogli et al., 2009).6 and selected the model that performed best on the
Dev set. With random restarts, we use the same
WNLI Winograd NLI is a small natural lan-
pre-trained checkpoint but perform different fine-
guage inference dataset deriving from (Levesque
tuning data shuffling and classifier layer initializa-
et al., 2011). The GLUE webpage notes that there
tion. We note that the GLUE data set distribution
are issues with the construction of this dataset, 7
does not include the Test labels, and we only made
and every trained system that’s been submitted
a single GLUE evaluation server submission for
to GLUE has has performed worse than the 65.1
each BERTBASE and BERTLARGE .
baseline accuracy of predicting the majority class.
Results are presented in Table 1. Both
We therefore exclude this set out of fairness to
BERTBASE and BERTLARGE outperform all exist-
OpenAI GPT. For our GLUE submission, we al-
ing systems on all tasks by a substantial margin,
ways predicted the majority class.
obtaining 4.4% and 6.7% respective average accu-
4.1.1 GLUE Results racy improvement over the state-of-the-art. Note
To fine-tune on GLUE, we represent the input se- that BERTBASE and OpenAI GPT are nearly iden-
quence or sequence pair as described in Section 3, tical in terms of model architecture outside of
and use the final hidden vector C ∈ RH corre- the attention masking. For the largest and most
sponding to the first input token ([CLS]) as the widely reported GLUE task, MNLI, BERT ob-
aggregate representation. This is demonstrated vi- tains a 4.7% absolute accuracy improvement over
sually in Figure 3 (a) and (b). The only new pa- the state-of-the-art. On the official GLUE leader-
rameters introduced during fine-tuning is a classi- board,8 BERTLARGE obtains a score of 80.4, com-
fication layer W ∈ RK×H , where K is the num- pared to the top leaderboard system, OpenAI GPT,
ber of labels. We compute a standard classification which obtains 72.8 as of the date of writing.
loss with C and W , i.e., log(softmax(CW T )). It is interesting to observe that BERTLARGE sig-
We use a batch size of 32 and 3 epochs over nificantly outperforms BERTBASE across all tasks,
the data for all GLUE tasks. For each task, we ran even those with very little training data. The effect
fine-tunings with learning rates of 5e-5, 4e-5, 3e-5, of BERT model size is explored more thoroughly
and 2e-5 and selected the one that performed best in Section 5.2.
on the Dev set. Additionally, for BERTLARGE we
4.2 SQuAD v1.1
found that fine-tuning was sometimes unstable on
The Standford Question Answering Dataset
6
Note that we only report single-task fine-tuning results in (SQuAD) is a collection of 100k crowdsourced
this paper. Multitask fine-tuning approach could potentially
push the results even further. For example, we did observe question/answer pairs (Rajpurkar et al., 2016).
substantial improvements on RTE from multi-task training Given a question and a paragraph from Wikipedia
with MNLI.
7 8
https://gluebenchmark.com/faq https://gluebenchmark.com/leaderboard
containing the answer, the task is to predict the an- System Dev Test
swer text span in the paragraph. For example: EM F1 EM F1
Leaderboard (Oct 8th, 2018)
• Input Question: Human - - 82.3 91.2
#1 Ensemble - nlnet - - 86.0 91.7
Where do water droplets collide with ice #2 Ensemble - QANet - - 84.5 90.5
crystals to form precipitation? #1 Single - nlnet - - 83.5 90.1
#2 Single - QANet - - 82.5 89.3
• Input Paragraph: Published
... Precipitation forms as smaller droplets BiDAF+ELMo (Single) - 85.8 - -
R.M. Reader (Single) 78.9 86.3 79.5 86.6
coalesce via collision with other rain drops
R.M. Reader (Ensemble) 81.2 87.9 82.3 88.5
or ice crystals within a cloud. ...
Ours
BERTBASE (Single) 80.8 88.5 - -
• Output Answer: BERTLARGE (Single) 84.1 90.9 - -
within a cloud BERTLARGE (Ensemble) 85.8 91.8 - -
BERTLARGE (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8
BERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2
This type of span prediction task is quite dif-
ferent from the sequence classification tasks of
Table 2: SQuAD results. The BERT ensemble is 7x
GLUE, but we are able to adapt BERT to run
systems which use different pre-training checkpoints
on SQuAD in a straightforward manner. Just as and fine-tuning seeds.
with GLUE, we represent the input question and
paragraph as a single packed sequence, with the
question using the A embedding and the paragraph from the SQuAD leaderboard do not have up-to-
using the B embedding. The only new parame- date public system descriptions available, and are
ters learned during fine-tuning are a start vector allowed to use any public data when training their
S ∈ RH and an end vector E ∈ RH . Let the final systems. We therefore use very modest data aug-
hidden vector from BERT for the ith input token mentation in our submitted system by jointly train-
be denoted as Ti ∈ RH . See Figure 3 (c) for a vi- ing on SQuAD and TriviaQA (Joshi et al., 2017).
sualization. Then, the probability of word i being Our best performing system outperforms the top
the start of the answer span is computed as a dot leaderboard system by +1.5 F1 in ensembling and
product between Ti and S followed by a softmax +1.3 F1 as a single system. In fact, our single
over all of the words in the paragraph: BERT model outperforms the top ensemble sys-
tem in terms of F1 score. If we fine-tune on only
eS·Ti SQuAD (without TriviaQA) we lose 0.1-0.4 F1
Pi = P S·T
je
j
and still outperform all existing systems by a wide
margin.
The same formula is used for the end of the an-
swer span, and the maximum scoring span is used
4.3 Named Entity Recognition
as the prediction. The training objective is the log-
likelihood of the correct start and end positions. To evaluate performance on a token tagging task,
We train for 3 epochs with a learning rate of 5e- we fine-tune BERT on the CoNLL 2003 Named
5 and a batch size of 32. At inference time, since Entity Recognition (NER) dataset. This dataset
the end prediction is not conditioned on the start, consists of 200k training words which have been
we add the constraint that the end must come after annotated as Person, Organization, Location,
the start, but no other heuristics are used. The tok- Miscellaneous, or Other (non-named entity).
enized labeled span is aligned back to the original For fine-tuning, we feed the final hidden
untokenized input for evaluation. representation Ti ∈ RH for to each token i into
Results are presented in Table 2. SQuAD uses a classification layer over the NER label set. The
a highly rigorous testing procedure where the sub- predictions are not conditioned on the surround-
mitter must manually contact the SQuAD organiz- ing predictions (i.e., non-autoregressive and no
ers to run their system on a hidden test set, so we CRF). To make this compatible with WordPiece
only submitted our best system for testing. The tokenization, we feed each CoNLL-tokenized
result shown in the table is our first and only Test input word into our WordPiece tokenizer and
submission to SQuAD. We note that the top results use the hidden state corresponding to the first
System Dev F1 Test F1 System Dev Test
ELMo+BiLSTM+CRF 95.7 92.2 ESIM+GloVe 51.9 52.7
CVT+Multi (Clark et al., 2018) - 92.6 ESIM+ELMo 59.1 59.2
BERTBASE 96.4 92.4 BERTBASE 81.6 -
BERTLARGE 96.6 92.8 BERTLARGE 86.6 86.3
Human (expert)† - 85.0
Table 3: CoNLL-2003 Named Entity Recognition re-
sults. The hyperparameters were selected using the Human (5 annotations)† - 88.0
Dev set, and the reported Dev and Test scores are aver-
aged over 5 random restarts using those hyperparame- Table 4: SWAG Dev and Test accuracies. Test results
ters. were scored against the hidden labels by the SWAG au-
thors. † Human performance is measure with 100 sam-
ples, as reported in the SWAG paper.
sub-token as input to the classifier. For example:
BERTBASE 84.4 88.4 86.7 92.7 88.5 #L #H #A LM (ppl) MNLI-m MRPC SST-2
No NSP 83.9 84.9 86.5 92.6 87.9 3 768 12 5.84 77.9 79.8 88.4
LTR & No NSP 82.1 84.3 77.5 92.1 77.8 6 768 3 5.24 80.6 82.2 90.7
+ BiLSTM 82.1 84.1 75.7 91.6 84.9 6 768 12 4.68 81.9 84.8 91.3
12 768 12 3.99 84.4 86.7 92.9
Table 5: Ablation over the pre-training tasks using the 12 1024 16 3.54 85.7 86.9 93.3
24 1024 16 3.23 86.6 87.8 93.7
BERTBASE architecture. “No NSP” is trained without
the next sentence prediction task. “LTR & No NSP” is
trained as a left-to-right LM without the next sentence Table 6: Ablation over BERT model size. #L = the
prediction, like OpenAI GPT. “+ BiLSTM” adds a ran- number of layers; #H = hidden size; #A = number of at-
domly initialized BiLSTM on top of the “LTR + No tention heads. “LM (ppl)” is the masked LM perplexity
NSP” model during fine-tuning. of held-out training data.
contains 110M parameters and BERTLARGE con- 5.4 Feature-based Approach with BERT
tains 340M parameters.
It has been known for many years that increas- All of the BERT results presented so far have used
ing the model size will lead to continual improve- the fine-tuning approach, where a simple classifi-
ments on large-scale tasks such as machine trans- cation layer is added to the pre-trained model, and
lation and language modeling, which is demon- all parameters are jointly fine-tuned on a down-
strated by the LM perplexity of held-out training stream task. However, the feature-based approach,
data shown in Table 6. However, we believe that where fixed features are extracted from the pre-
this is the first work to demonstrate that scaling to trained model, has certain advantages. First, not
extreme model sizes also leads to large improve- all NLP tasks can be easily be represented by a
ments on very small scale tasks, provided that the Transformer encoder architecture, and therefore
model has been sufficiently pre-trained. require a task-specific model architecture to be
5.3 Effect of Number of Training Steps added. Second, there are major computational
benefits to being able to pre-compute an expensive
Figure 4 presents MNLI Dev accuracy after fine-
representation of the training data once and then
tuning from a checkpoint that has been pre-trained
run many experiments with less expensive models
for k steps. This allows us to answer the following
on top of this representation.
questions:
In this section we evaluate how well BERT per-
1. Question: Does BERT really need such forms in the feature-based approach by generating
a large amount of pre-training (128,000 ELMo-like pre-trained contextual representations
words/batch * 1,000,000 steps) to achieve on the CoNLL-2003 NER task. To do this, we use
high fine-tuning accuracy? the same input representation as in Section 4.3, but
Answer: Yes, BERTBASE achieves almost use the activations from one or more layers with-
1.0% additional accuracy on MNLI when out fine-tuning any parameters of BERT. These
trained on 1M steps compared to 500k steps. contextual embeddings are used as input to a ran-
domly initialized two-layer 768-dimensional BiL-
2. Question: Does MLM pre-training converge
STM before the classification layer.
slower than LTR pre-training, since only 15%
of words are predicted in each batch rather Results are shown in Table 7. The best perform-
than every word? ing method is to concatenate the token representa-
Answer: The MLM model does converge tions from the top four hidden layers of the pre-
slightly slower than the LTR model. How- trained Transformer, which is only 0.3 F1 behind
ever, in terms of absolute accuracy the MLM fine-tuning the entire model. This demonstrates
model begins to outperform the LTR model that BERT is effective for both the fine-tuning and
almost immediately. feature-based approaches.
84
Layers Dev F1
MNLI Dev Accuracy
82
Finetune All 96.4
80 First Layer (Embeddings) 91.0
Second-to-Last Hidden 95.6
78
Last Hidden 94.9
BERTBASE (Masked LM)
76 BERTBASE (Left-to-Right)
Sum Last Four Hidden 95.9
200 400 600 800 1,000
Concat Last Four Hidden 96.1
Pre-training Steps (Thousands) Sum All 12 Layers 95.5
Figure 4: Ablation over number of training steps. This Table 7: Ablation using BERT with a feature-based ap-
shows the MNLI accuracy after fine-tuning, starting proach on CoNLL-2003 NER. The activations from the
from model parameters that have been pre-trained for specified layers are combined and fed into a two-layer
k steps. The x-axis is the value of k. BiLSTM, without backpropagation to BERT.
6 Conclusion ing progress in statistical language modeling. arXiv
preprint arXiv:1312.3005.
Recent empirical improvements due to transfer
learning with language models have demonstrated Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.
Quora question pairs.
that rich, unsupervised pre-training is an integral
part of many language understanding systems. In Kevin Clark, Minh-Thang Luong, Christopher D Man-
particular, these results enable even low-resource ning, and Quoc V Le. 2018. Semi-supervised se-
quence modeling with cross-view training. arXiv
tasks to benefit from very deep unidirectional ar- preprint arXiv:1809.08370.
chitectures. Our major contribution is further gen-
eralizing these findings to deep bidirectional ar- Ronan Collobert and Jason Weston. 2008. A unified
architecture for natural language processing: Deep
chitectures, allowing the same pre-trained model neural networks with multitask learning. In Pro-
to successfully tackle a broad set of NLP tasks. ceedings of the 25th International Conference on
While the empirical results are strong, in some Machine Learning, ICML ’08.
cases surpassing human performance, important
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c
future work is to investigate the linguistic phenom- Barrault, and Antoine Bordes. 2017. Supervised
ena that may or may not be captured by BERT. learning of universal sentence representations from
natural language inference data. In Proceedings of
the 2017 Conference on Empirical Methods in Nat-
References ural Language Processing, pages 670–680, Copen-
hagen, Denmark. Association for Computational
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Linguistics.
Guo, and Llion Jones. 2018. Character-level lan-
guage modeling with deeper self-attention. arXiv Andrew M Dai and Quoc V Le. 2015. Semi-supervised
preprint arXiv:1808.04444. sequence learning. In Advances in neural informa-
tion processing systems, pages 3079–3087.
Rie Kubota Ando and Tong Zhang. 2005. A framework
for learning predictive structures from multiple tasks J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
and unlabeled data. Journal of Machine Learning Fei. 2009. ImageNet: A Large-Scale Hierarchical
Research, 6(Nov):1817–1853. Image Database. In CVPR09.
Luisa Bentivogli, Bernardo Magnini, Ido Dagan, William B Dolan and Chris Brockett. 2005. Automati-
Hoa Trang Dang, and Danilo Giampiccolo. 2009. cally constructing a corpus of sentential paraphrases.
The fifth PASCAL recognizing textual entailment In Proceedings of the Third International Workshop
challenge. In TAC. NIST. on Paraphrasing (IWP2005).
John Blitzer, Ryan McDonald, and Fernando Pereira. Dan Hendrycks and Kevin Gimpel. 2016. Bridging
2006. Domain adaptation with structural correspon- nonlinearities and stochastic regularizers with gaus-
dence learning. In Proceedings of the 2006 confer- sian error linear units. CoRR, abs/1606.08415.
ence on empirical methods in natural language pro- Jeremy Howard and Sebastian Ruder. 2018. Universal
cessing, pages 120–128. Association for Computa- language model fine-tuning for text classification. In
tional Linguistics. ACL. Association for Computational Linguistics.
Samuel R. Bowman, Gabor Angeli, Christopher Potts, Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke
and Christopher D. Manning. 2015. A large anno- Zettlemoyer. 2017. Triviaqa: A large scale distantly
tated corpus for learning natural language inference. supervised challenge dataset for reading comprehen-
In EMNLP. Association for Computational Linguis- sion. In ACL.
tics.
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
Peter F Brown, Peter V Desouza, Robert L Mercer, Richard Zemel, Raquel Urtasun, Antonio Torralba,
Vincent J Della Pietra, and Jenifer C Lai. 1992. and Sanja Fidler. 2015. Skip-thought vectors. In
Class-based n-gram models of natural language. Advances in neural information processing systems,
Computational linguistics, 18(4):467–479. pages 3294–3302.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Quoc Le and Tomas Mikolov. 2014. Distributed rep-
Gazpio, and Lucia Specia. 2017. Semeval-2017 resentations of sentences and documents. In Inter-
task 1: Semantic textual similarity-multilingual and national Conference on Machine Learning, pages
cross-lingual focused evaluation. arXiv preprint 1188–1196.
arXiv:1708.00055.
Hector J Levesque, Ernest Davis, and Leora Morgen-
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, stern. 2011. The winograd schema challenge. In
Thorsten Brants, Phillipp Koehn, and Tony Robin- Aaai spring symposium: Logical formalizations of
son. 2013. One billion word benchmark for measur- commonsense reasoning, volume 46, page 47.
Lajanugen Logeswaran and Honglak Lee. 2018. An Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
efficient framework for learning sentence represen- Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
tations. In International Conference on Learning Kaiser, and Illia Polosukhin. 2017. Attention is all
Representations. you need. In Advances in Neural Information Pro-
cessing Systems, pages 6000–6010.
Bryan McCann, James Bradbury, Caiming Xiong, and
Richard Socher. 2017. Learned in translation: Con- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
textualized word vectors. In NIPS. Pierre-Antoine Manzagol. 2008. Extracting and
composing robust features with denoising autoen-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- coders. In Proceedings of the 25th international
rado, and Jeff Dean. 2013. Distributed representa- conference on Machine learning, pages 1096–1103.
tions of words and phrases and their compositional- ACM.
ity. In Advances in Neural Information Processing
Systems 26, pages 3111–3119. Curran Associates, Alex Wang, Amapreet Singh, Julian Michael, Felix
Inc. Hill, Omer Levy, and Samuel R Bowman. 2018.
Glue: A multi-task benchmark and analysis platform
Jeffrey Pennington, Richard Socher, and Christo-
for natural language understanding. arXiv preprint
pher D. Manning. 2014. Glove: Global vectors for
arXiv:1804.07461.
word representation. In Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 1532– A. Warstadt, A. Singh, and S. R. Bowman. 2018. Cor-
1543. pus of linguistic acceptability.
Matthew Peters, Waleed Ammar, Chandra Bhagavat- Adina Williams, Nikita Nangia, and Samuel R Bow-
ula, and Russell Power. 2017. Semi-supervised se- man. 2018. A broad-coverage challenge corpus
quence tagging with bidirectional language models. for sentence understanding through inference. In
In ACL. NAACL.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Gardner, Christopher Clark, Kenton Lee, and Luke
Le, Mohammad Norouzi, Wolfgang Macherey,
Zettlemoyer. 2018. Deep contextualized word rep-
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
resentations. In NAACL.
Macherey, et al. 2016. Google’s neural ma-
Alec Radford, Karthik Narasimhan, Tim Salimans, and chine translation system: Bridging the gap between
Ilya Sutskever. 2018. Improving language under- human and machine translation. arXiv preprint
standing with unsupervised learning. Technical re- arXiv:1609.08144.
port, OpenAI.
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Lipson. 2014. How transferable are features in deep
Percy Liang. 2016. Squad: 100,000+ questions neural networks? In Advances in neural information
for machine comprehension of text. arXiv preprint processing systems, pages 3320–3328.
arXiv:1606.05250.
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin
Richard Socher, Alex Perelygin, Jean Wu, Jason Choi. 2018. Swag: A large-scale adversarial dataset
Chuang, Christopher D Manning, Andrew Ng, and for grounded commonsense inference. In Proceed-
Christopher Potts. 2013. Recursive deep models ings of the 2018 Conference on Empirical Methods
for semantic compositionality over a sentiment tree- in Natural Language Processing (EMNLP).
bank. In Proceedings of the 2013 conference on
empirical methods in natural language processing, Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
pages 1631–1642. dinov, Raquel Urtasun, Antonio Torralba, and Sanja
Fidler. 2015. Aligning books and movies: Towards
Wilson L Taylor. 1953. cloze procedure: A new story-like visual explanations by watching movies
tool for measuring readability. Journalism Bulletin, and reading books. In Proceedings of the IEEE
30(4):415–433. international conference on computer vision, pages
19–27.
Erik F Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the conll-2003 shared task:
Language-independent named entity recognition. In
Proceedings of the seventh conference on Natural
language learning at HLT-NAACL 2003-Volume 4,
pages 142–147. Association for Computational Lin-
guistics.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Word representations: A simple and general method
for semi-supervised learning. In Proceedings of the
48th Annual Meeting of the Association for Compu-
tational Linguistics, ACL ’10, pages 384–394.