WordEmbedding Class
Word Embeddings transform is a text featurizer which converts vectors of text tokens into sentence vectors using a pre-trained model.
Note
As WordEmbedding requires a column with text vector, e.g.
<'This', 'is', 'good'>, users need to create an input column by:
concatenating columns with TX type,
or using the output_tokens_column_name for NGramFeaturizer() to
c:\hostedtoolcache\windows\python\3.6.8\x64\lib\site-packages\nimbusml\feature_extraction\text_init_.py:docstring of nimbusml.feature_extraction.text.WordEmbedding:51: (WARNING/2) Bullet list ends without a blank line; unexpected unindent.
convert a column with sentences like "This is good" into <'This',
'is', 'good'>.
In the following example, after the NGramFeaturizer, features
named ngram.__ are generated.
A new column named ngram_TransformedText is also created with the
text vector, similar as running .split(' ').
However, due to the variable length of this column it cannot be
properly converted to pandas dataframe,
thus any pipelines/transforms output this text vector column will
throw errors. However, we use ngram_TransformedText as the input to
WordEmbedding, the
ngram_TransformedText column will be overwritten by the output from
WordEmbedding. The output from WordEmbedding is named
ngram_TransformedText.__
- Inheritance
-
nimbusml.internal.core.feature_extraction.text._wordembedding.WordEmbeddingWordEmbeddingnimbusml.base_transform.BaseTransformWordEmbeddingsklearn.base.TransformerMixinWordEmbedding
Constructor
WordEmbedding(model_kind='SentimentSpecificWordEmbedding', custom_lookup_table=None, columns=None, **params)
Parameters
Name | Description |
---|---|
columns
|
a dictionary of key-value pairs, where key is the output column name and value is the input column name.
simply specify The << operator can be used to set this value (see Column Operator) For example
For more details see Columns. |
model_kind
|
Pre-trained model used to create the vocabulary. Available options are: 'GloVe50D', 'GloVe100D', 'GloVe200D', 'GloVe300D', 'GloVeTwitter25D', 'GloVeTwitter50D', 'GloVeTwitter100D', 'GloVeTwitter200D', 'FastTextWikipedia300D', 'SentimentSpecificWordEmbedding'. |
custom_lookup_table
|
Filename for custom word embedding model. |
params
|
Additional arguments sent to compute engine. |
Examples
###############################################################################
# WordEmbedding: pre-trained DNN model
# for text.
from nimbusml import FileDataStream, Pipeline
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.text import NGramFeaturizer, WordEmbedding
from nimbusml.feature_extraction.text.extractor import Ngram
# data input (as a FileDataStream)
path = get_dataset('wiki_detox_train').as_filepath()
data = FileDataStream.read_csv(path, sep='\t')
print(data.head())
# Sentiment SentimentText
# 0 1 ==RUDE== Dude, you are rude upload that carl p...
# 1 1 == OK! == IM GOING TO VANDALIZE WILD ONES WIK...
# 2 1 Stop trolling, zapatancas, calling me a liar m...
# 3 1 ==You're cool== You seem like a really cool g...
# 4 1 ::::: Why are you threatening me? I'm not bein...
# transform usage
pipeline = Pipeline([
NGramFeaturizer(word_feature_extractor=Ngram(), output_tokens_column_name='ngram_TransformedText',
columns={'ngram': ['SentimentText']}),
WordEmbedding(columns='ngram_TransformedText')
])
# fit and transform
features = pipeline.fit_transform(data)
# print features
print(features.head())
# Sentiment ... ngram.douchiest ngram.award.
# 0 1 ... 0.0 0.0
# 1 1 ... 0.0 0.0
# 2 1 ... 0.0 0.0
# 3 1 ... 0.0 0.0
# 4 1 ... 0.0 0.0
Remarks
WordEmbeddings wrap different embedding models, such as Sentiment Specific Word Embedding(SSWE). Users can specify which embedding to use. The available options are various versions of GloVe Models, FastText, and Sswe.
Methods
get_params |
Get the parameters for this operator. |
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
Name | Description |
---|---|
deep
|
Default value: False
|