Embeddings¶
Embedding, also call vectorization or representation, is using a high demesional vector to represent the meaning of a work/document. The most important topic of modern NLP is how to create a proper representation of a word or text. Let’s see the following examples.
My previous section already discussed some ways to vectorized the documents. Here I would introduce the machine learning methods to make embeddings.
Word embedding¶
Load packages and data
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from TextCleaner import clean_text
## data source - https://www.kaggle.com/bman93/dataset/data#
df = pd.read_csv("data/Top30.csv")
docs = df.Description
len(docs)
72292
Word2Vec
clean and tokenize data
%%time
tokenss = []
for doc in docs:
tokenss.append(clean_text(doc))
Wall time: 2min 30s, processed ~70k docs.
To generate good work vectors, I would suggest at least 1m jobs. This preprocess will take ~30 mins. If even more data, parallel by PySpark is necessary, see my pyspark notes.
Train model
%%time
model = Word2Vec(
tokenss, # list of tokens
size=500, # vector length
window=4, # maximum distance between the current and predicted word
min_count=5, # ignores words with frequency lower than 5
workers=4 # number of threads
)
Wall time: 1min 55s
Check results
test_list = ["python", "javascript", "powerbi", "excel", "git"]
for word in test_list:
print(word, ":\n", model.wv.most_similar(word, topn=5))
python :
[(‘rdbms’, 0.8215682506561279), (‘perl’, 0.8211899399757385), (‘tomcat’, 0.8167069554328918), (‘weblogic’, 0.8105891942977905), (‘jms’, 0.8051434755325317)]
javascript :
[(‘struts’, 0.8272807002067566), (‘xml’, 0.8139920234680176), (‘jquery’, 0.8123090863227844), (‘html’, 0.8114718198776245), (‘xslt’, 0.794143795967102)]
powerbi :
[(‘obiee’, 0.6827813982963562), (‘tableau’, 0.6598259210586548), (‘visualization’, 0.6336531639099121), (‘query’, 0.6239848136901855), (‘iri’, 0.6232795119285583)]
excel :
[(‘ms’, 0.666321873664856), (‘powerpoint’, 0.6542035341262817), (‘macros’, 0.6245964765548706), (‘vlookups’, 0.6167137622833252), (‘microsoft’, 0.6154444217681885)]
git :
[(‘svn’, 0.7295184135437012), (‘weblogic’, 0.6927921772003174), (‘tomcat’, 0.6880820989608765), (‘subversion’, 0.6861574053764343), (‘jms’, 0.6797549724578857)]
Document embedding¶
Doc2Vec
make tagged documents
documents = [TaggedDocument(tokens, [i]) for i, tokens in enumerate(tokenss)]
initialize the model, similar as word2vec
model = Doc2Vec(
vector_size=500,
window=4,
min_count=5,
workers=4
)
build vocabulary
model.build_vocab(documents)
train the model
model.train(documents,
total_examples=model.corpus_count,
epochs=model.epochs)
Reference
- Word2Vec explanation - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
- Word2Vec paper - 1.https://arxiv.org/pdf/1301.3781.pdf, 2.https://arxiv.org/pdf/1310.4546.pdf
- Doc2Vec paper - https://arxiv.org/pdf/1405.4053v2.pdf
- gensim - https://radimrehurek.com/gensim/models/word2vec.html