BERT (language model): Difference between revisions

Latest revision as of 15:08, 10 November 2025

Introduction

BERT, or Bidirectional Encoder Representations from Transformers, is a transformer-based machine learning technique for natural language processing (NLP). It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As such, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

A computer screen displaying a representation of the BERT model

Background

BERT is a method of pre-training language representations that was created and published in 2018 by researchers at Google. BERT is designed to build a generative model by predicting words in a text. It is a departure from previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. BERT's key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling.

Technical Details

Architecture

BERT's architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in the paper "Attention is All You Need". There are two models introduced in the paper. BERT-base is a 12-layer, 768-hidden, 12-heads, 110M parameter model while BERT-large is a 24-layer, 1024-hidden, 16-heads, 340M parameter model.

Training

BERT is pre-trained on a large corpus of text, which includes the entirety of Wikipedia (in English) and BookCorpus. The model is then fine-tuned for specific tasks. BERT is trained on two NLP tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP).

In the masked language model task, the model randomly masks out certain words in the input and the model must predict the masked words, given the context provided by the non-masked words. In the next sentence prediction task, the model is provided pairs of sentences and must predict if the second sentence in the pair is the subsequent sentence in the original document.

Application

Once pre-training has been completed, BERT can be fine-tuned for specific tasks. These tasks include text classification, entity recognition, and question answering among others. BERT has been used by Google in its search engine to better understand user queries.

Advantages and Limitations

BERT's bidirectional approach (which allows it to access context from both past and future directions) is a significant improvement over previous models. It has been proven to improve the performance of many NLP tasks. However, BERT is computationally expensive and requires a significant amount of resources to train. This makes it difficult for smaller organizations or individual researchers to train BERT from scratch.

@@ Line 3: / Line 3: @@
 BERT, or Bidirectional Encoder Representations from Transformers, is a [[Transformer (machine learning model)|transformer-based]] machine learning technique for [[Natural language processing|natural language processing]] (NLP). It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As such, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
-[[Image:Detail-146189.jpg|thumb|center|A computer screen displaying a representation of the BERT model]]
+[[Image:Detail-146189.jpg|thumb|center|A computer screen displaying a representation of the BERT model|class=only_on_mobile]]
+[[Image:Detail-146190.jpg|thumb|center|A computer screen displaying a representation of the BERT model|class=only_on_desktop]]
 == Background ==