CharTokenizer Class

Text transforms that can be performed on data before training a model.

Inheritance
nimbusml.internal.core.preprocessing.text._chartokenizer.CharTokenizer
CharTokenizer
nimbusml.base_transform.BaseTransform
CharTokenizer
sklearn.base.TransformerMixin
CharTokenizer

Constructor

CharTokenizer(use_marker_chars=True, columns=None, **params)

Parameters

Name Description
columns

a dictionary of key-value pairs, where key is the output column name and value is the input column name.

  • Multiple key-value pairs are allowed.

  • Input column type: string.

  • Output column type:

    Vector Type.

  • If the output column names are same as the input column names, then

simply specify columns as a list of strings.

The << operator can be used to set this value (see Column Operator)

For example

  • CharTokenizer(columns={'out1':'input1', 'out2':'input2'})

  • CharTokenizer() << {'out1':'input1', 'out2':'input2'}

For more details see Columns.

use_marker_chars

Whether to mark the beginning/end of each row/slot with start of text character (0x02)/end of text character (0x03).

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # CharTokenizer
   import numpy
   from nimbusml import FileDataStream, DataSchema, Pipeline
   from nimbusml.datasets import get_dataset
   from nimbusml.preprocessing import FromKey
   from nimbusml.preprocessing.text import CharTokenizer
   from nimbusml.preprocessing.schema import ColumnSelector
   from nimbusml.feature_extraction.text import WordEmbedding

   # data input (as a FileDataStream)
   path = get_dataset('wiki_detox_train').as_filepath()

   file_schema = DataSchema.read_schema(
       path, sep='\t', numeric_dtype=numpy.float32)
   data = FileDataStream(path, schema=file_schema)
   print(data.head())

   #    Sentiment                                      SentimentText
   # 0        1.0  ==RUDE== Dude, you are rude upload that carl p...
   # 1        1.0  == OK! ==  IM GOING TO VANDALIZE WILD ONES WIK...
   # 2        1.0  Stop trolling, zapatancas, calling me a liar m...
   # 3        1.0  ==You're cool==  You seem like a really cool g...
   # 4        1.0  ::::: Why are you threatening me? I'm not bein...

   # After using Character Tokenizer, it will convert the vector of Char to Key type.
   # Use FromKey to retrieve the data from Key first, then send into WordEmbedding.

   pipe = Pipeline([
           CharTokenizer(columns={'SentimentText_Transform': 'SentimentText'}),
           FromKey(columns={'SentimentText_FromKey': 'SentimentText_Transform'}),
           WordEmbedding(model_kind='GloVe50D', columns={'Feature': 'SentimentText_FromKey'}),
           ColumnSelector(columns=['Sentiment', 'SentimentText', 'Feature'])
           ])

   print(pipe.fit_transform(data).head())

   #    Sentiment  ... Feature.149
   # 0        1.0  ...     2.67440
   # 1        1.0  ...     0.78858
   # 2        1.0  ...     2.67440
   # 3        1.0  ...     2.67440
   # 4        1.0  ...     2.67440

   # [5 rows x 152 columns]

Remarks

The CharTokenizer transform is a character-oriented tokenizer where text is considered a sequence of characters.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name Description
deep
Default value: False