CharTokenizer Class

Reference

Text transforms that can be performed on data before training a model.

Inheritance: nimbusml.internal.core.preprocessing.text._chartokenizer.CharTokenizer

CharTokenizer

nimbusml.base_transform.BaseTransform

CharTokenizer

sklearn.base.TransformerMixin

CharTokenizer

Constructor

CharTokenizer(use_marker_chars=True, columns=None, **params)

Parameters

Name	Description
columns	a dictionary of key-value pairs, where key is the output column name and value is the input column name. Multiple key-value pairs are allowed. Input column type: string. Output column type: Vector Type. If the output column names are same as the input column names, then simply specify `columns` as a list of strings. The << operator can be used to set this value (see Column Operator) For example CharTokenizer(columns={'out1':'input1', 'out2':'input2'}) CharTokenizer() << {'out1':'input1', 'out2':'input2'} For more details see Columns.
use_marker_chars	Whether to mark the beginning/end of each row/slot with start of text character (0x02)/end of text character (0x03).
params	Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # CharTokenizer
   import numpy
   from nimbusml import FileDataStream, DataSchema, Pipeline
   from nimbusml.datasets import get_dataset
   from nimbusml.preprocessing import FromKey
   from nimbusml.preprocessing.text import CharTokenizer
   from nimbusml.preprocessing.schema import ColumnSelector
   from nimbusml.feature_extraction.text import WordEmbedding

   # data input (as a FileDataStream)
   path = get_dataset('wiki_detox_train').as_filepath()

   file_schema = DataSchema.read_schema(
       path, sep='\t', numeric_dtype=numpy.float32)
   data = FileDataStream(path, schema=file_schema)
   print(data.head())

   #    Sentiment                                      SentimentText
   # 0        1.0  ==RUDE== Dude, you are rude upload that carl p...
   # 1        1.0  == OK! ==  IM GOING TO VANDALIZE WILD ONES WIK...
   # 2        1.0  Stop trolling, zapatancas, calling me a liar m...
   # 3        1.0  ==You're cool==  You seem like a really cool g...
   # 4        1.0  ::::: Why are you threatening me? I'm not bein...

   # After using Character Tokenizer, it will convert the vector of Char to Key type.
   # Use FromKey to retrieve the data from Key first, then send into WordEmbedding.

   pipe = Pipeline([
           CharTokenizer(columns={'SentimentText_Transform': 'SentimentText'}),
           FromKey(columns={'SentimentText_FromKey': 'SentimentText_Transform'}),
           WordEmbedding(model_kind='GloVe50D', columns={'Feature': 'SentimentText_FromKey'}),
           ColumnSelector(columns=['Sentiment', 'SentimentText', 'Feature'])
           ])

   print(pipe.fit_transform(data).head())

   #    Sentiment  ... Feature.149
   # 0        1.0  ...     2.67440
   # 1        1.0  ...     0.78858
   # 2        1.0  ...     2.67440
   # 3        1.0  ...     2.67440
   # 4        1.0  ...     2.67440

   # [5 rows x 152 columns]

Remarks

The CharTokenizer transform is a character-oriented tokenizer where text is considered a sequence of characters.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name	Description
deep	Default value: False

通过

CharTokenizer Class

Constructor

Parameters

Examples

Remarks

Methods

get_params

Parameters