LightLda Class
The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.
- Inheritance
-
nimbusml.internal.core.feature_extraction.text._lightlda.LightLdaLightLdanimbusml.base_transform.BaseTransformLightLdasklearn.base.TransformerMixinLightLda
Constructor
LightLda(num_topic=100, number_of_threads=0, num_max_doc_token=512, alpha_sum=100.0, beta=0.01, mhstep=4, num_iterations=200, likelihood_interval=5, num_summary_term_per_topic=10, num_burnin_iterations=10, reset_random_generator=False, output_topic_word_summary=False, columns=None, **params)
Parameters
Name | Description |
---|---|
columns
|
see Columns. |
num_topic
|
The number of topics. |
number_of_threads
|
The number of training threads. Default value depends on number of logical processors. |
num_max_doc_token
|
The threshold of maximum count of tokens per doc. |
alpha_sum
|
Dirichlet prior on document-topic vectors. |
beta
|
Dirichlet prior on vocab-topic vectors. |
mhstep
|
Number of Metropolis Hasting step. |
num_iterations
|
Number of iterations. |
likelihood_interval
|
Compute log likelihood over local dataset on this iteration interval. |
num_summary_term_per_topic
|
The number of words to summarize the topic. |
num_burnin_iterations
|
The number of burn-in iterations. |
reset_random_generator
|
Reset the random number generator for each document. |
output_topic_word_summary
|
Whether to output the topic-word summary in text format. |
params
|
Additional arguments sent to compute engine. |
Examples
###############################################################################
# LightLda
from nimbusml import FileDataStream, Pipeline
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.text import NGramFeaturizer, LightLda
from nimbusml.feature_extraction.text.extractor import Ngram
# data input as a FileDataStream
path = get_dataset('topics').as_filepath()
data = FileDataStream.read_csv(path, sep=",")
print(data.head())
# review review_reverse label
# 0 animals birds cats dogs fish horse radiation galaxy universe duck 1
# 1 horse birds house fish duck cats space galaxy universe radiation 0
# 2 car truck driver bus pickup bus pickup 1
# transform usage
pipeline = Pipeline(
[
NGramFeaturizer(
word_feature_extractor=Ngram(),
vector_normalizer='None',
columns=['review']),
LightLda(
num_topic=3,
columns=['review'])])
# fit and transform
features = pipeline.fit_transform(data)
print(features.head())
# label review.0 review.1 review.2 review_reverse
# 0 1 0.500000 0.333333 0.166667 radiation galaxy universe duck
# 1 0 0.000000 0.166667 0.833333 space galaxy universe radiation
# 2 1 0.400000 0.200000 0.400000 bus pickup
# 3 0 0.333333 0.333333 0.333333 car truck
# 4 1 1.000000 0.000000 0.000000 car truck driver bus pickup horse
Remarks
Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data, and can be used to featurize any text fields as low-dimensional topical vectors. LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of optimization techniques (https://arxiv.org/abs/1412.1576). With the LDA transform, we can train a topic model to produce 1 million topics with 1 million vocabulary on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters). The most significant innovation is a super-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, allowing it to converges nearly an order of magnitude faster than other Gibbs samplers.
Methods
get_params |
Get the parameters for this operator. |
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
Name | Description |
---|---|
deep
|
Default value: False
|