Machine learning models - BERT¶
In this chapter, let’s try BERT.
Background¶
BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art technique for NLP pre-training developed by Google in 2018. It is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. It achieved state-of-the-art performance on many NLP tasks[1]. Google has already applied it for query searching, claimed that the search improvement by BERT as “one of the biggest leaps forward in the history of Search”[2].
In this chapter, we will continue on text classification task as the previous chapters, to see how BERT can improve the performance.
Codes¶
Full code can be find in the notebooks/BERT-classifier.ipynb. The notebook should run on TPU on Google Colab.
- Load pre-train model
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip uncased_L-12_H-768_A-12.zip
bert_folder = "uncased_L-12_H-768_A-12/"
bert_config_file = os.path.join(bert_folder, "bert_config.json")
bert_ckpt_file = os.path.join(bert_folder, "bert_model.ckpt")
bert_vocab_file = os.path.join(bert_folder, "vocab.txt")
- Preprocessing
tokenizer = FullTokenizer(vocab_file=bert_vocab_file)
max_len = 64
def _convert_single(input_text):
tokens = tokenizer.tokenize(input_text)
tokens = ["[CLS]"] + tokens + ["[SEP]"]
token_ids = tokenizer.convert_tokens_to_ids(tokens)
return token_ids
def _convert_multiple(input_list):
token_ids_list = []
max_len = 1
for sent in tqdm(input_list):
token_ids = _convert_single(sent)
token_ids_list.append(token_ids)
return token_ids_list
def _pad(token_ids_list):
x_padded = []
for input_ids in token_ids_list:
input_ids = input_ids[:min(len(input_ids), max_len - 2)]
input_ids = input_ids + [0] * (max_len - len(input_ids))
x_padded.append(np.array(input_ids))
return np.array(x_padded)
def convert(input_list):
token_ids_list = _convert_multiple(input_list)
out_array = _pad(token_ids_list)
return out_array
X_train = convert(train.sentence)
X_test = convert(test.sentence)
y_train = train.label.map(label_dict).values
y_test = test.label.map(label_dict).values
- Create model
def create_model():
with tf.io.gfile.GFile(bert_config_file, "r") as reader:
bc = StockBertConfig.from_json_string(reader.read())
bert_params = map_stock_config_to_params(bc)
bert_params.adapter_size = None
bert = BertModelLayer.from_params(bert_params, name="bert")
input_ids = tf.keras.layers.Input(
shape=(max_len, ),
dtype='int32',
name="input_ids"
)
bert_output = bert(input_ids)
print("bert shape", bert_output.shape)
cls_out = tf.keras.layers.Lambda(lambda seq: seq[:, 0, :])(bert_output)
cls_out = tf.keras.layers.Dropout(0.5)(cls_out)
logits = tf.keras.layers.Dense(units=768, activation="tanh")(cls_out)
logits = tf.keras.layers.Dropout(0.5)(logits)
logits = tf.keras.layers.Dense(
units=len(classes),
activation="softmax"
)(logits)
model = tf.keras.Model(inputs=input_ids, outputs=logits)
model.build(input_shape=(None, max_len))
load_stock_weights(bert, bert_ckpt_file)
return model
model = create_model()
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy(name="acc")]
)
Train model
history = model.fit(
x=X_train,
y=y_train,
validation_split=0.2,
batch_size=32,
shuffle=True,
epochs=10,
callbacks=[tensorboard_callback]
)
References:
- BERT paper - https://arxiv.org/pdf/1810.04805.pdf
- Google blog apply BERT for searching - https://www.blog.google/products/search/search-language-understanding-bert/