LightLda Class

Reference

The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.

Inheritance: nimbusml.internal.core.feature_extraction.text._lightlda.LightLda

LightLda

nimbusml.base_transform.BaseTransform

LightLda

sklearn.base.TransformerMixin

LightLda

Constructor

LightLda(num_topic=100, number_of_threads=0, num_max_doc_token=512, alpha_sum=100.0, beta=0.01, mhstep=4, num_iterations=200, likelihood_interval=5, num_summary_term_per_topic=10, num_burnin_iterations=10, reset_random_generator=False, output_topic_word_summary=False, columns=None, **params)

Parameters

Name	Description
columns	see Columns.
num_topic	The number of topics.
number_of_threads	The number of training threads. Default value depends on number of logical processors.
num_max_doc_token	The threshold of maximum count of tokens per doc.
alpha_sum	Dirichlet prior on document-topic vectors.
beta	Dirichlet prior on vocab-topic vectors.
mhstep	Number of Metropolis Hasting step.
num_iterations	Number of iterations.
likelihood_interval	Compute log likelihood over local dataset on this iteration interval.
num_summary_term_per_topic	The number of words to summarize the topic.
num_burnin_iterations	The number of burn-in iterations.
reset_random_generator	Reset the random number generator for each document.
output_topic_word_summary	Whether to output the topic-word summary in text format.
params	Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # LightLda
   from nimbusml import FileDataStream, Pipeline
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.text import NGramFeaturizer, LightLda
   from nimbusml.feature_extraction.text.extractor import Ngram

   # data input as a FileDataStream
   path = get_dataset('topics').as_filepath()
   data = FileDataStream.read_csv(path, sep=",")
   print(data.head())
   #                               review                    review_reverse  label
   # 0  animals birds cats dogs fish horse   radiation galaxy universe duck      1
   # 1    horse birds house fish duck cats  space galaxy universe radiation      0
   # 2         car truck driver bus pickup                       bus pickup      1

   # transform usage
   pipeline = Pipeline(
       [
           NGramFeaturizer(
               word_feature_extractor=Ngram(),
               vector_normalizer='None',
               columns=['review']),
           LightLda(
               num_topic=3,
               columns=['review'])])

   # fit and transform
   features = pipeline.fit_transform(data)
   print(features.head())
   #   label  review.0  review.1  review.2                     review_reverse
   # 0      1  0.500000  0.333333  0.166667     radiation galaxy universe duck
   # 1      0  0.000000  0.166667  0.833333    space galaxy universe radiation
   # 2      1  0.400000  0.200000  0.400000                         bus pickup
   # 3      0  0.333333  0.333333  0.333333                          car truck
   # 4      1  1.000000  0.000000  0.000000  car truck driver bus pickup horse

Remarks

Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data, and can be used to featurize any text fields as low-dimensional topical vectors. LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of optimization techniques (https://arxiv.org/abs/1412.1576). With the LDA transform, we can train a topic model to produce 1 million topics with 1 million vocabulary on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters). The most significant innovation is a super-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, allowing it to converges nearly an order of magnitude faster than other Gibbs samplers.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name	Description
deep	Default value: False

通过

LightLda Class

Constructor

Parameters

Examples

Remarks

Methods

get_params

Parameters