LightLda Class

The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.

Inheritance
nimbusml.internal.core.feature_extraction.text._lightlda.LightLda
LightLda
nimbusml.base_transform.BaseTransform
LightLda
sklearn.base.TransformerMixin
LightLda

Constructor

LightLda(num_topic=100, number_of_threads=0, num_max_doc_token=512, alpha_sum=100.0, beta=0.01, mhstep=4, num_iterations=200, likelihood_interval=5, num_summary_term_per_topic=10, num_burnin_iterations=10, reset_random_generator=False, output_topic_word_summary=False, columns=None, **params)

Parameters

Name Description
columns

see Columns.

num_topic

The number of topics.

number_of_threads

The number of training threads. Default value depends on number of logical processors.

num_max_doc_token

The threshold of maximum count of tokens per doc.

alpha_sum

Dirichlet prior on document-topic vectors.

beta

Dirichlet prior on vocab-topic vectors.

mhstep

Number of Metropolis Hasting step.

num_iterations

Number of iterations.

likelihood_interval

Compute log likelihood over local dataset on this iteration interval.

num_summary_term_per_topic

The number of words to summarize the topic.

num_burnin_iterations

The number of burn-in iterations.

reset_random_generator

Reset the random number generator for each document.

output_topic_word_summary

Whether to output the topic-word summary in text format.

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # LightLda
   from nimbusml import FileDataStream, Pipeline
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.text import NGramFeaturizer, LightLda
   from nimbusml.feature_extraction.text.extractor import Ngram

   # data input as a FileDataStream
   path = get_dataset('topics').as_filepath()
   data = FileDataStream.read_csv(path, sep=",")
   print(data.head())
   #                               review                    review_reverse  label
   # 0  animals birds cats dogs fish horse   radiation galaxy universe duck      1
   # 1    horse birds house fish duck cats  space galaxy universe radiation      0
   # 2         car truck driver bus pickup                       bus pickup      1

   # transform usage
   pipeline = Pipeline(
       [
           NGramFeaturizer(
               word_feature_extractor=Ngram(),
               vector_normalizer='None',
               columns=['review']),
           LightLda(
               num_topic=3,
               columns=['review'])])

   # fit and transform
   features = pipeline.fit_transform(data)
   print(features.head())
   #   label  review.0  review.1  review.2                     review_reverse
   # 0      1  0.500000  0.333333  0.166667     radiation galaxy universe duck
   # 1      0  0.000000  0.166667  0.833333    space galaxy universe radiation
   # 2      1  0.400000  0.200000  0.400000                         bus pickup
   # 3      0  0.333333  0.333333  0.333333                          car truck
   # 4      1  1.000000  0.000000  0.000000  car truck driver bus pickup horse

Remarks

Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data, and can be used to featurize any text fields as low-dimensional topical vectors. LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of optimization techniques (https://arxiv.org/abs/1412.1576). With the LDA transform, we can train a topic model to produce 1 million topics with 1 million vocabulary on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters). The most significant innovation is a super-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, allowing it to converges nearly an order of magnitude faster than other Gibbs samplers.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name Description
deep
Default value: False