Construct an Index Vector From Two Input Vectors in Vectorized Fashion

Affiliate iv. Text Vectorization and Transformation Pipelines

Automobile learning algorithms operate on a numeric characteristic infinite, expecting input as a two-dimensional array where rows are instances and columns are features. In order to perform auto learning on text, we need to transform our documents into vector representations such that we can apply numeric machine learning. This process is called feature extraction or more but, vectorization, and is an essential first step toward language-aware assay.

Representing documents numerically gives us the power to perform meaningful analytics and also creates the instances on which automobile learning algorithms operate. In text analysis, instances are entire documents or utterances, which can vary in length from quotes or tweets to entire books, but whose vectors are always of a uniform length. Each property of the vector representation is a feature. For text, features represent attributes and properties of documents—including its content too as meta attributes, such every bit document length, author, source, and publication date. When considered together, the features of a certificate describe a multidimensional feature space on which machine learning methods can exist applied.

For this reason, we must at present brand a disquisitional shift in how we think nearly linguistic communication—from a sequence of words to points that occupy a loftier-dimensional semantic space. Points in infinite tin be close together or far apart, tightly clustered or evenly distributed. Semantic space is therefore mapped in such a way where documents with similar meanings are closer together and those that are unlike are farther apart. By encoding similarity every bit distance, nosotros can brainstorm to derive the primary components of documents and draw decision boundaries in our semantic infinite.

The simplest encoding of semantic space is the pocketbook-of-words model, whose primary insight is that meaning and similarity are encoded in vocabulary. For example, the Wikipedia articles about baseball and Babe Ruth are probably very like. Not only will many of the aforementioned words announced in both, they will non share many words in common with articles about casseroles or quantitative easing. This model, while simple, is extremely effective and forms the starting point for the more circuitous models nosotros will explore.

In this affiliate, nosotros will demonstrate how to utilise the vectorization process to combine linguistic techniques from NLTK with machine learning techniques in Scikit-Learn and Gensim, creating custom transformers that tin can be used within repeatable and reusable pipelines. By the cease of this chapter, nosotros volition be ready to engage our preprocessed corpus, transforming documents to model space so that we can begin making predictions.

Words in Infinite

To vectorize a corpus with a pocketbook-of-words (BOW) approach, we correspond every certificate from the corpus every bit a vector whose length is equal to the vocabulary of the corpus. We can simplify the computation by sorting token positions of the vector into alphabetical guild, as shown in Figure 4-1. Alternatively, nosotros can keep a dictionary that maps tokens to vector positions. Either way, we arrive at a vector mapping of the corpus that enables u.s.a. to uniquely represent every document.

Vector encoding is a basic representation of documents.

What should each element in the document vector be? In the next few sections, nosotros will explore several choices, each of which extends or modifies the base of operations bag-of-words model to describe semantic space. We will await at 4 types of vector encoding—frequency, one-hot, TF–IDF, and distributed representations—and hash out their implementations in Scikit-Learn, Gensim, and NLTK. We'll operate on a pocket-size corpus of the iii sentences in the instance figures.

To ready this up, permit's create a list of our documents and tokenize them for the proceeding vectorization examples. The tokenize method performs some lightweight normalization, stripping punctuation using the string.punctuation character set and setting the text to lowercase. This function as well performs some feature reduction using the SnowballStemmer to remove affixes such as plurality ("bats" and "bat" are the same token). The examples in the side by side section volition utilize this example corpus and some will employ the tokenization method.

            import            nltk            import            cord            def            tokenize            (            text            ):            stem            =            nltk            .            stalk            .            SnowballStemmer            (            'english language'            )            text            =            text            .            lower            ()            for            token            in            nltk            .            word_tokenize            (            text            ):            if            token            in            cord            .            punctuation            :            continue            yield            stem            .            stem            (            token            )            corpus            =            [            "The elephant sneezed at the sight of potatoes."            ,            "Bats can see via echolocation. Meet the bat sight sneeze!"            ,            "Wondering, she opened the door to the studio."            ,            ]

The choice of a specific vectorization technique volition be largely driven by the trouble space. Similarly, our choice of implementation—whether NLTK, Scikit-Larn, or Gensim—should be dictated by the requirements of the awarding. For instance, NLTK offers many methods that are especially well-suited to text data, merely is a big dependency. Scikit-Learn was non designed with text in mind, but does offering a robust API and many other conveniences (which we'll explore subsequently in this chapter) particularly useful in an applied context. Gensim can serialize dictionaries and references in matrix market format, making it more flexible for multiple platforms. However, different Scikit-Learn, Gensim doesn't do any work on behalf of your documents for tokenization or stemming.

For this reason, as we walk through each of the four approaches to encoding, we'll show a few options for implementation—"With NLTK," "In Scikit-Learn," and "The Gensim Way."

Frequency Vectors

The simplest vector encoding model is to merely fill in the vector with the frequency of each word every bit it appears in the certificate. In this encoding scheme, each certificate is represented equally the multiset of the tokens that compose information technology and the value for each give-and-take position in the vector is its count. This representation can either be a directly count (integer) encoding as shown in Figure 4-two or a normalized encoding where each word is weighted by the full number of words in the certificate.

Bag of words encoding uses the frequency of words in the document to encode the vector.

With NLTK

NLTK expects features every bit a dict object whose keys are the names of the features and whose values are boolean or numeric. To encode our documents in this way, we'll create a vectorize function that creates a dictionary whose keys are the tokens in the certificate and whose values are the number of times that token appears in the certificate.

The defaultdict object allows us to specify what the dictionary volition render for a central that hasn't been assigned to it notwithstanding. By setting defaultdict(int) nosotros are specifying that a 0 should be returned, thus creating a simple counting dictionary. Nosotros can map this office to every particular in the corpus using the final line of code, creating an iterable of vectorized documents.

                from                collections                import                defaultdict                def                vectorize                (                dr.                ):                features                =                defaultdict                (                int                )                for                token                in                tokenize                (                dr.                ):                features                [                token                ]                +=                1                render                features                vectors                =                map                (                vectorize                ,                corpus                )

In Scikit-Learn

The CountVectorizer transformer from the sklearn.feature_extraction model has its ain internal tokenization and normalization methods. The fit method of the vectorizer expects an iterable or list of strings or file objects, and creates a dictionary of the vocabulary on the corpus. When transform is chosen, each individual document is transformed into a thin array whose index tuple is the row (the document ID) and the token ID from the lexicon, and whose value is the count:

                from                sklearn.feature_extraction.text                import                CountVectorizer                vectorizer                =                CountVectorizer                ()                vectors                =                vectorizer                .                fit_transform                (                corpus                )

Notation

Vectors can become extremely thin, particularly as vocabularies get larger, which can take a significant affect on the speed and performance of machine learning models. For very large corpora, it is recommended to utilise the Scikit-Learn HashingVectorizer, which uses a hashing pull a fast one on to find the token string proper noun to feature alphabetize mapping. This ways information technology uses very low memory and scales to large datasets as it does non need to store the entire vocabulary and information technology is faster to pickle and fit since in that location is no country. However, there is no changed transform (from vector to text), there tin be collisions, and there is no inverse document frequency weighting.

The Gensim way

Gensim'due south frequency encoder is called doc2bow. To use doc2bow, nosotros first create a Gensim Lexicon that maps tokens to indices based on observed order (eliminating the overhead of lexicographic sorting). The lexicon object can be loaded or saved to disk, and implements a doc2bow library that accepts a pretokenized document and returns a sparse matrix of (id, count) tuples where the id is the token's id in the lexicon. Because the doc2bow method only takes a single document instance, we employ the list comprehension to restore the unabridged corpus, loading the tokenized documents into retentivity so we don't exhaust our generator:

                import                gensim                corpus                =                [                tokenize                (                physician                )                for                medico                in                corpus                ]                id2word                =                gensim                .                corpora                .                Dictionary                (                corpus                )                vectors                =                [                id2word                .                doc2bow                (                doc                )                for                doc                in                corpus                ]

One-Hot Encoding

Because they condone grammer and the relative position of words in documents, frequency-based encoding methods suffer from the long tail, or Zipfian distribution, that characterizes natural language. Every bit a effect, tokens that occur very frequently are orders of magnitude more "significant" than other, less frequent ones. This tin have a significant touch on some models (e.g., generalized linear models) that expect normally distributed features.

A solution to this problem is one-hot encoding, a boolean vector encoding method that marks a item vector alphabetize with a value of true (1) if the token exists in the document and faux (0) if information technology does not. In other words, each element of a one-hot encoded vector reflects either the presence or absence of the token in the described text every bit shown in Figure 4-iii.

Each element of a one-hot encoded vector reflects the presence or absence of the token in the described text.

I-hot encoding reduces the imbalance consequence of the distribution of tokens, simplifying a document to its constituent components. This reduction is most effective for very small-scale documents (sentences, tweets) that don't contain very many repeated elements, and is unremarkably applied to models that have very good smoothing properties. One-hot encoding is also commonly used in artificial neural networks, whose activation functions require input to be in the discrete range of [0,1] or [-ane,i].

With NLTK

The NLTK implementation of one-hot encoding is a lexicon whose keys are tokens and whose value is Truthful:

                def                vectorize                (                doc                ):                return                {                token                :                True                for                token                in                doc                }                vectors                =                map                (                vectorize                ,                corpus                )

Dictionaries deed as elementary sparse matrices in the NLTK example because information technology is non necessary to marker every absent word False. In addition to the boolean lexicon values, it is also acceptable to use an integer value; 1 for present and 0 for absent.

In Scikit-Learn

In Scikit-Learn, one-hot encoding is implemented with the Binarizer transformer in the preprocessing module. The Binarizer takes only numeric information, so the text information must be transformed into a numeric infinite using the CountVectorizer ahead of one-hot encoding. The Binarizer class uses a threshold value (0 by default) such that all values of the vector that are less than or equal to the threshold are set to naught, while those that are greater than the threshold are ready to 1. Therefore, by default, the Binarizer converts all frequency values to ane while maintaining the zilch-valued frequencies.

                from                sklearn.preprocessing                import                Binarizer                freq                =                CountVectorizer                ()                corpus                =                freq                .                fit_transform                (                corpus                )                onehot                =                Binarizer                ()                corpus                =                onehot                .                fit_transform                (                corpus                .                toarray                ())

The corpus.toarray() method is optional; it converts the sparse matrix representation to a dense ane. In corpora with large vocabularies, the sparse matrix representation is much better. Note that we could also employ CountVectorizer(binary=Truthful) to accomplish one-hot encoding in the above, obviating the Binarizer.

Caution

In spite of its proper noun, the OneHotEncoder transformer in the sklearn.preprocessing module is non exactly the right fit for this job. The OneHotEncoder treats each vector component (column) as an independent categorical variable, expanding the dimensionality of the vector for each observed value in each column. In this case, the component (sight, 0) and (sight, ane) would be treated as two categorical dimensions rather than as a single binary encoded vector component.

The Gensim way

While Gensim does non have a specific one-hot encoder, its doc2bow method returns a list of tuples that we tin can manage on the fly. Extending the code from the Gensim frequency vectorization case in the previous department, nosotros can 1-hot encode our vectors with our id2word dictionary. To get our vectors, an inner list comprehension converts the listing of tuples returned from the doc2bow method into a list of (token_id, ane) tuples and the outer comprehension applies that converter to all documents in the corpus:

                corpus                =                [                tokenize                (                medico                )                for                doc                in                corpus                ]                id2word                =                gensim                .                corpora                .                Lexicon                (                corpus                )                vectors                =                [                [(                token                [                0                ],                1                )                for                token                in                id2word                .                doc2bow                (                dr.                )]                for                doctor                in                corpus                ]

One-hot encoding represents similarity and deviation at the document level, but because all words are rendered equidistant, information technology is non able to encode per-word similarity. Moreover, because all words are every bit afar, word form becomes incredibly important; the tokens "trying" and "attempt" will be equally afar from unrelated tokens like "red" or "bicycle"! Normalizing tokens to a single word class, either through stemming or lemmatization, which we'll explore subsequently in this chapter, ensures that different forms of tokens that embed plurality, case, gender, cardinality, tense, etc., are treated as unmarried vector components, reducing the feature space and making models more performant.

Term Frequency–Inverse Document Frequency

The bag-of-words representations that we have explored so far only describe a document in a standalone fashion, not taking into business relationship the context of the corpus. A improve arroyo would be to consider the relative frequency or rareness of tokens in the document against their frequency in other documents. The central insight is that significant is nigh probable encoded in the more than rare terms from a certificate. For example, in a corpus of sports text, tokens such as "umpire," "base," and "dugout" appear more frequently in documents that discuss baseball, while other tokens that announced oftentimes throughout the corpus, like "run," "score," and "play," are less of import.

TF–IDF, term frequency–inverse certificate frequency, encoding normalizes the frequency of tokens in a document with respect to the rest of the corpus. This encoding approach accentuates terms that are very relevant to a specific instance, equally shown in Figure four-iv, where the token studio has a higher relevance to this document since it only appears there.

Term frequency-inverse document frequency encodes documents relative to it's most unique and relevant terms.

TF–IDF is computed on a per-term basis, such that the relevance of a token to a document is measured past the scaled frequency of the appearance of the term in the certificate, normalized by the inverse of the scaled frequency of the term in the entire corpus.

With NLTK

To vectorize text in this style with NLTK, we use the TextCollection class, a wrapper for a list of texts or a corpus consisting of 1 or more than texts. This class provides back up for counting, concordancing, collocation discovery, and more than chiefly, computing tf_idf.

Because TF–IDF requires the entire corpus, our new version of vectorize does non accept a single certificate, but rather all documents. Later applying our tokenization function and creating the text collection, the function goes through each certificate in the corpus and yields a lexicon whose keys are the terms and whose values are the TF–IDF score for the term in that item document.

                from                nltk.text                import                TextCollection                def                vectorize                (                corpus                ):                corpus                =                [                tokenize                (                physician                )                for                doc                in                corpus                ]                texts                =                TextCollection                (                corpus                )                for                doctor                in                corpus                :                yield                {                term                :                texts                .                tf_idf                (                term                ,                doc                )                for                term                in                doc                }

In Scikit-Acquire

Scikit-Learn provides a transformer chosen the TfidfVectorizer in the module called feature_extraction.text for vectorizing documents with TF–IDF scores. Under the hood, the TfidfVectorizer uses the CountVectorizer figurer we used to produce the purse-of-words encoding to count occurrences of tokens, followed by a TfidfTransformer, which normalizes these occurrence counts by the inverse document frequency.

The input for a TfidfVectorizer is expected to be a sequence of filenames, file-like objects, or strings that contain a collection of raw documents, similar to that of the CountVectorizer. As a result, a default tokenization and preprocessing method is applied unless other functions are specified. The vectorizer returns a sparse matrix representation in the form of ((doctor, term), tfidf) where each key is a document and term pair and the value is the TF–IDF score.

                from                sklearn.feature_extraction.text                import                TfidfVectorizer                tfidf                =                TfidfVectorizer                ()                corpus                =                tfidf                .                fit_transform                (                corpus                )

The Gensim style

In Gensim, the TfidfModel data structure is similar to the Dictionary object in that it stores a mapping of terms and their vector positions in the order they are observed, just additionally stores the corpus frequency of those terms and then it tin can vectorize documents on need. Every bit before, Gensim allows us to apply our own tokenization method, expecting a corpus that is a listing of lists of tokens. We outset construct the lexicon and utilize it to instantiate the TfidfModel, which computes the normalized inverse document frequency. Nosotros can then fetch the TF–IDF representation for each vector using a getitem, lexicon-like syntax, after applying the doc2bow method to each document using the dictionary.

                corpus                =                [                tokenize                (                physician                )                for                physician                in                corpus                ]                lexicon                =                gensim                .                corpora                .                Dictionary                (                corpus                )                tfidf                =                gensim                .                models                .                TfidfModel                (                dictionary                =                lexicon                ,                normalize                =                Truthful                )                vectors                =                [                tfidf                [                dictionary                .                doc2bow                (                doc                )]                for                doc                in                corpus                ]

Gensim provides helper functionality to write dictionaries and models to disk in a compact format, meaning you tin can conveniently save both the TF–IDF model and the lexicon to disk in order to load them afterward to vectorize new documents. It is possible (though slightly more than work) to attain the same event by using the pickle module in combination with Scikit-Larn. To save a Gensim model to deejay:

                lexicon                .                save_as_text                (                'lexicon.txt'                ,                sort_by_word                =                True                )                tfidf                .                save                (                'tfidf.pkl'                )

This volition salvage the lexicon every bit a text-delimited text file, sorted lexicographically, and the TF–IDF model every bit a pickled thin matrix. Annotation that the Dictionary object can likewise be saved more compactly in a binary format using its relieve method, but save_as_text allows piece of cake inspection of the lexicon for afterward work. To load the models from disk:

                lexicon                =                gensim                .                corpora                .                Dictionary                .                load_from_text                (                'lexicon.txt'                )                tfidf                =                gensim                .                models                .                TfidfModel                .                load                (                'tfidf.pkl'                )

One benefit of TF–IDF is that it naturally addresses the problem of stopwords, those words near likely to appear in all documents in the corpus (e.g., "a," "the," "of", etc.), and thus volition accrue very small weights under this encoding scheme. This biases the TF–IDF model toward moderately rare words. As a issue TF–IDF is widely used for bag-of-words models, and is an fantabulous starting point for nearly text analytics.

Distributed Representation

While frequency, one-hot, and TF–IDF encoding enable us to put documents into vector infinite, it is often useful to too encode the similarities between documents in the context of that same vector space. Unfortunately, these vectorization methods produce certificate vectors with non-negative elements, which means we won't be able to compare documents that don't share terms (considering two vectors with a cosine altitude of ane volition be considered far apart, even if they are semantically similar).

When document similarity is important in the context of an awarding, we instead encode text along a continuous calibration with a distributed representation, as shown in Figure four-5. This means that the resulting certificate vector is not a simple mapping from token position to token score. Instead, the document is represented in a feature space that has been embedded to represent discussion similarity. The complexity of this space (and the resulting vector length) is the product of how the mapping to that representation is learned. The complexity of this space (and the resulting vector length) is the product of how that representation is trained and not directly tied to the document itself.

A distributed representation allots weight continuously along a vector to encode information about a word.

Word2vec, created by a team of researchers at Google led past Tomáš Mikolov, implements a word embedding model that enables u.s.a. to create these kinds of distributed representations. The word2vec algorithm trains discussion representations based on either a continuous bag-of-words (CBOW) or skip-gram model, such that words are embedded in space forth with like words based on their context. For instance, Gensim's implementation uses a feedforward network.

The doc2vec ¹ algorithm is an extension of word2vec. It proposes a paragraph vector—an unsupervised algorithm that learns fixed-length feature representations from variable length documents. This representation attempts to inherit the semantic properties of words such that "scarlet" and "colorful" are more like to each other than they are to "river" or "governance." Moreover, the paragraph vector takes into consideration the ordering of words within a narrow context, like to an northward-gram model. The combined result is much more effective than a pocketbook-of-words or pocketbook-of-north-grams model because it generalizes better and has a lower dimensionality just still is of a fixed length so it tin can be used in mutual machine learning algorithms.

The Gensim way

Neither NLTK nor Scikit-Learn provide implementations of these kinds of word embeddings. Gensim's implementation allows users to train both word2vec and doc2vec models on custom corpora and also conveniently comes with a model that is pretrained on the Google news corpus.

Annotation

To use Gensim'southward pretrained models, you lot'll demand to download the model bin file, which clocks in at 1.5 GB. For applications that require extremely lightweight dependencies (e.g., if they have to run on an AWS lambda instance), this may not be practicable.

We can train our ain model as follows. First, we employ a list comprehension to load our corpus into retentiveness. (Gensim supports streaming, only this will enable united states to avoid exhausting the generator.) Next, we create a list of TaggedDocument objects, which extend the LabeledSentence, and in turn the distributed representation of word2vec. TaggedDocument objects consist of words and tags. Nosotros can instantiate the tagged document with the list of tokens forth with a single tag, one that uniquely identifies the instance. In this case, nosotros've labeled each document as "d{}".format(idx), e.g. d0, d1, d2 and so forth.

Once we have a list of tagged documents, we instantiate the Doc2Vec model and specify the size of the vector as well as the minimum count, which ignores all tokens that have a frequency less than that number. The size parameter is usually not every bit low a dimensionality as 5; we selected such a small number for demonstration purposes only. We besides set the min_count parameter to nil to ensure we consider all tokens, but generally this is gear up between three and 5, depending on how much information the model needs to capture. Once instantiated, an unsupervised neural network is trained to learn the vector representations, which can and then be accessed via the docvecs property.

                from                gensim.models.doc2vec                import                TaggedDocument                ,                Doc2Vec                corpus                =                [                listing                (                tokenize                (                doc                ))                for                medico                in                corpus                ]                corpus                =                [                TaggedDocument                (                words                ,                [                'd{}'                .                format                (                idx                )])                for                idx                ,                words                in                enumerate                (                corpus                )                ]                model                =                Doc2Vec                (                corpus                ,                size                =                v                ,                min_count                =                0                )                print                (                model                .                docvecs                [                0                ])                # [ 0.01797447 -0.01509272  0.0731937   0.06814702 -0.0846546 ]

Distributed representations will dramatically improve results over TF–IDF models when used correctly. The model itself can be saved to deejay and retrained in an active fashion, making it extremely flexible for a diverseness of use cases. However, on larger corpora, training can be slow and retentivity intensive, and it might not exist as expert as a TF–IDF model with Principal Component Assay (PCA) or Singular Value Decomposition (SVD) applied to reduce the feature space. In the stop, yet, this representation is quantum piece of work that has led to a dramatic improvement in text processing capabilities of data products in recent years.

Once again, the choice of vectorization technique (as well as the library implementation) tend to be utilise case- and application-specific, as summarized in Table four-1.

Table four-1. Overview of text vectorization methods
Vectorization Method	Function	Good For	Considerations
Frequency	Counts term frequencies	Bayesian models	Most frequent words not always well-nigh informative
One-Hot Encoding	Binarizes term occurrence (0, ane)	Neural networks	All words equidistant, so normalization extra important
TF–IDF	Normalizes term frequencies across documents	General purpose	Moderately frequent terms may non be representative of document topics
Distributed Representations	Context-based, continuous term similarity encoding	Modeling more circuitous relationships	Functioning intensive; difficult to scale without additional tools (e.g., Tensorflow)

Afterward in this chapter we will explore the Scikit-Learn Pipeline object, which enables the states to streamline vectorization together with later modeling phrases. Equally such, we often prefer to use vectorizers that conform to the Scikit-Learn API. In the next section, we will discuss how the API is organized and demonstrate how to integrate vectorization into a consummate pipeline to construct the core of a fully operational (and customizable!) textual motorcar learning awarding.

The Scikit-Learn API

Scikit-Learn is an extension of SciPy (a scikit) whose primary purpose is to provide machine learning algorithms equally well every bit the tools and utilities required to engage in successful modeling. Its primary contribution is an "API for machine learning" that exposes the implementations of a wide array of model families into a single, user-friendly interface. The upshot is that Scikit-Learn can be used to simultaneously train a staggering variety of models, evaluate and compare them, and so utilize the fitted model to make predictions on new data. Because Scikit-Learn provides a standardized API, this can be done with lilliputian effort and models can be prototyped and evaluated by simply swapping out a few lines of lawmaking.

The BaseEstimator Interface

The API itself is object-oriented and describes a bureaucracy of interfaces for unlike car learning tasks. The root of the hierarchy is an Calculator, broadly any object that can larn from data. The primary Reckoner objects implement classifiers, regressors, or clustering algorithms. Nevertheless, they tin can also include a wide array of data manipulation, from dimensionality reduction to feature extraction from raw data. The Computer essentially serves equally an interface, and classes that implement Estimator functionality must take 2 methods—fit and predict—as shown here:

              from              sklearn.base              import              BaseEstimator              class              Reckoner              (              BaseEstimator              ):              def              fit              (              self              ,              X              ,              y              =              None              ):              """                              Have input information, X, and optional target data, y. Returns cocky.                              """              return              self              def              predict              (              self              ,              X              ):              """                              Take input data, Ten and return a vector of predictions for each row.                              """              return              yhat

The Computer.fit method sets the land of the estimator based on the training data, X and y. The grooming data X is expected to be matrix-like—for example, a two-dimensional NumPy assortment of shape (n_samples, n_features) or a Pandas DataFrame whose rows are the instances and whose columns are the features. Supervised estimators are also fit with a i-dimensional NumPy array, y, that holds the correct labels. The plumbing equipment process modifies the internal land of the reckoner such that information technology is ready or able to make predictions. This land is stored in instance variables that are usually postfixed with an underscore (east.g., Estimator.coefs_). Because this method modifies an internal country, it returns self so the method tin can be chained.

The Estimator.predict method creates predictions using the internal, fitted country of the model on the new information, X. The input for the method must have the same number of columns as the training data passed to fit, and can accept as many rows every bit predictions are required. This method returns a vector, yhat, which contains the predictions for each row in the input data.

Note

Extending Scikit-Learn's BaseEstimator automatically gives the Calculator a fit_predict method, which allows you to combine fit and predict in one unproblematic call.

Reckoner objects have parameters (besides called hyperparameters) that define how the fitting procedure is conducted. These parameters are fix when the Estimator is instantiated (and if not specified, they are ready to reasonable defaults), and can exist modified with the get_param and set_param methods that are besides available from the BaseEstimator super form.

We engage the Scikit-Acquire API past specifying the parcel and blazon of the computer. Here nosotros select the Naive Bayes model family, and a specific member of the family unit, a multinomial model (which is suitable for text nomenclature). The model is defined when the form is instantiated and hyperparameters are passed in. Hither nosotros pass an alpha parameter that is used for additive smoothing, as well equally prior probabilities for each of our two classes. The model is trained on specific data (documents and labels) and at that point becomes a fitted model. This basic usage is the aforementioned for every model (Estimator) in Scikit-Learn, from random wood decision tree ensembles to logistic regressions and beyond.

              from              sklearn.naive_bayes              import              MultinomialNB              model              =              MultinomialNB              (              blastoff              =              0.0              ,              class_prior              =              [              0.4              ,              0.half-dozen              ])              model              .              fit              (              documents              ,              labels              )

Extending TransformerMixin

Scikit-Learn too specifies utilities for performing machine learning in a repeatable fashion. We could non discuss Scikit-Learn without too discussing the Transformer interface. A Transformer is a special type of Estimator that creates a new dataset from an onetime i based on rules that it has learned from the plumbing fixtures procedure. The interface is as follows:

              from              sklearn.base              import              TransformerMixin              grade              Transfomer              (              BaseEstimator              ,              TransformerMixin              ):              def              fit              (              self              ,              X              ,              y              =              None              ):              """                              Learn how to transform data based on input information, Ten.                              """              return              self              def              transform              (              self              ,              X              ):              """                              Transform X into a new dataset, Xprime and return it.                              """              return              Xprime

The Transformer.transform method takes a dataset and returns a new dataset, X`, with new values based on the transformation process. There are several transformers included in Scikit-Larn, including transformers to normalize or scale features, handle missing values (imputation), perform dimensionality reduction, extract or select features, or perform mappings from one feature space to some other.

Although both NLTK, Gensim, and even newer text analytics libraries like SpaCy have their own internal APIs and learning mechanisms, the scope and comprehensiveness of Scikit-Acquire models and methodologies for machine learning make it an essential function of the modeling workflow. Every bit a result, we suggest to apply the API to create our own Transformer and Estimator objects that implement methods from NLTK and Gensim. For instance, we can create topic modeling estimators that wrap Gensim'south LDA and LSA models (which are not currently included in Scikit-Larn) or create transformers that utilize NLTK'southward office-of-speech tagging and named entity chunking methods.

Creating a custom Gensim vectorization transformer

Gensim vectorization techniques are an interesting case written report because Gensim corpora tin exist saved and loaded from disk in such a way as to remain decoupled from the pipeline. Nonetheless, information technology is possible to build a custom transformer that uses Gensim vectorization. Our GensimVectorizer transformer will wrap a Gensim Dictionary object generated during fit() and whose doc2bow method is used during transform() . The Dictionary object (like the TfidfModel) can be saved and loaded from disk, so our transformer utilizes that methodology by taking a path on instantiation. If a file exists at that path, it is loaded immediately. Additionally, a save() method allows us to write our Dictionary to disk, which we tin can do in fit().

The fit() method constructs the Dictionary object by passing already tokenized and normalized documents to the Lexicon constructor. The Dictionary is and then immediately saved to deejay so that the transformer can be loaded without requiring a refit. The transform() method uses the Lexicon.doc2bow method, which returns a sparse representation of the certificate every bit a list of (token_id, frequency) tuples. This representation can nowadays challenges with Scikit-Learn, nevertheless, and then we utilize a Gensim helper role, sparse2full, to convert the sparse representation into a NumPy array.

                import                bone                from                gensim.corpora                import                Dictionary                from                gensim.matutils                import                sparse2full                class                GensimVectorizer                (                BaseEstimator                ,                TransformerMixin                ):                def                __init__                (                self                ,                path                =                None                ):                self                .                path                =                path                self                .                id2word                =                None                cocky                .                load                ()                def                load                (                cocky                ):                if                os                .                path                .                exists                (                cocky                .                path                ):                self                .                id2word                =                Dictionary                .                load                (                self                .                path                )                def                save                (                self                ):                self                .                id2word                .                save                (                self                .                path                )                def                fit                (                cocky                ,                documents                ,                labels                =                None                ):                self                .                id2word                =                Lexicon                (                documents                )                self                .                relieve                ()                return                self                def                transform                (                self                ,                documents                ):                for                document                in                documents                :                docvec                =                self                .                id2word                .                doc2bow                (                certificate                )                yield                sparse2full                (                docvec                ,                len                (                self                .                id2word                ))

Information technology is piece of cake to see how the vectorization methodologies that we discussed before in the affiliate tin can exist wrapped by Scikit-Learn transformers. This gives us more flexibility in the approaches we take, while yet allowing u.s.a. to leverage the auto learning utilities in each library. We volition leave it to the reader to extend this instance and investigate TF–IDF and distributed representation transformers that are implemented in the same fashion.

Creating a custom text normalization transformer

Many model families suffer from "the curse of dimensionality"; every bit the feature space increases in dimensions, the data becomes more thin and less informative to the underlying decision space. Text normalization reduces the number of dimensions, decreasing sparsity. Besides the unproblematic filtering of tokens (removing punctuation and stopwords), there are two principal methods for text normalization: stemming and lemmatization.

Stemming uses a series of rules (or a model) to piece a string to a smaller substring. The goal is to remove word affixes (specially suffixes) that modify pregnant. For example, removing an 'south' or 'es', which generally indicates plurality in Latin languages. Lemmatization, on the other mitt, uses a dictionary to look upwards every token and returns the canonical "head" word in the dictionary, called a lemma. Because it is looking upwardly tokens from a basis truth, information technology can handle irregular cases as well every bit handle tokens with different parts of spoken language. For example, the verb 'gardening' should be lemmatized to 'to garden', while the nouns 'garden' and 'gardener' are both different lemmas. Stemming would capture all of these tokens into a single 'garden' token.

Stemming and lemmatization have their advantages and disadvantages. Considering it merely requires us to splice word strings, stemming is faster. Lemmatization, on the other paw, requires a lookup to a lexicon or database, and uses part-of-speech tags to identify a give-and-take'south root lemma, making it noticeably slower than stemming, but too more effective.

To perform text normalization in a systematic fashion, nosotros will write a custom transformer that puts these pieces together. Our TextNormalizer course takes as input a language that is used to load the correct stopwords from the NLTK corpus. We could likewise customize the TextNormalizer to allow uses to choose between stemming and lemmatization, and pass the linguistic communication into the SnowballStemmer. For filtering extraneous tokens, we create ii methods. The offset, is_punct(), checks if every character in the token has a Unicode category that starts with 'P' (for punctuation); the second, is_stopword() determines if the token is in our set of stopwords.

                import                unicodedata                from                sklearn.base                import                BaseEstimator                ,                TransformerMixin                class                TextNormalizer                (                BaseEstimator                ,                TransformerMixin                ):                def                __init__                (                self                ,                language                =                'english language'                ):                self                .                stopwords                =                gear up                (                nltk                .                corpus                .                stopwords                .                words                (                language                ))                self                .                lemmatizer                =                WordNetLemmatizer                ()                def                is_punct                (                cocky                ,                token                ):                return                all                (                unicodedata                .                category                (                char                )                .                startswith                (                'P'                )                for                char                in                token                )                def                is_stopword                (                self                ,                token                ):                return                token                .                lower                ()                in                self                .                stopwords

Nosotros tin then add a normalize() method that takes a single document composed of a listing of paragraphs, which are lists of sentences, which are lists of (token, tag) tuples—the information format that we preprocessed raw HTML to in Chapter 3.

                def                normalize                (                self                ,                certificate                ):                return                [                cocky                .                lemmatize                (                token                ,                tag                )                .                lower                ()                for                paragraph                in                certificate                for                sentence                in                paragraph                for                (                token                ,                tag                )                in                judgement                if                not                cocky                .                is_punct                (                token                )                and                non                self                .                is_stopword                (                token                )                ]

This method applies the filtering functions to remove unwanted tokens and so lemmatizes them. The lemmatize() method outset converts the Penn Treebank role-of-speech tags that are the default tag set in the nltk.pos_tag function to WordNet tags, selecting nouns past default.

                def                lemmatize                (                self                ,                token                ,                pos_tag                ):                tag                =                {                'N'                :                wn                .                NOUN                ,                'V'                :                wn                .                VERB                ,                'R'                :                wn                .                ADV                ,                'J'                :                wn                .                ADJ                }                .                get                (                pos_tag                [                0                ],                wn                .                NOUN                )                render                self                .                lemmatizer                .                lemmatize                (                token                ,                tag                )

Finally, we must add the Transformer interface, allowing us to add this grade to a Scikit-Learn pipeline, which we'll explore in the next section:

                def                fit                (                cocky                ,                X                ,                y                =                None                ):                return                self                def                transform                (                self                ,                documents                ):                for                document                in                documents                :                yield                cocky                .                normalize                (                certificate                )

Notation that text normalization is simply 1 methodology, and also utilizes NLTK very heavily, which may add unnecessary overhead to your application. Other options could include removing tokens that announced above or beneath a particular count threshold or removing stopwords and and then merely selecting the outset 5 to ten g about mutual words. Yet another option is simply calculating the cumulative frequency and only selecting words that contain 10%–50% of the cumulative frequency distribution. These methods would let us to ignore both the very low frequency hapaxes (terms that appear only once) and the most mutual words, enabling united states to identify the most potentially predictive terms in the corpus.

Caution

The act of text normalization should exist optional and practical carefully because the operation is destructive in that it removes data. Instance, punctuation, stopwords, and varying word constructions are all critical to understanding language. Some models may crave indicators such as case. For example, a named entity recognition classifier, because in English, proper nouns are capitalized.

An culling arroyo is to perform dimensionality reduction with Chief Component Analysis (PCA) or Singular Value Decomposition (SVD), to reduce the characteristic space to a specific dimensionality (e.g., 5 or x thousand dimensions) based on word frequency. These transformers would have to be applied following a vectorizer transformer, and would take the event of merging together words that are similar into the same vector space.

Pipelines

The machine learning process often combines a series of transformers on raw information, transforming the dataset each pace of the way until it is passed to the fit method of a final estimator. Merely if we don't vectorize our documents in the same verbal fashion, nosotros will finish upwards with incorrect or, at the very to the lowest degree, unintelligible results. The Scikit-Learn Pipeline object is the solution to this dilemma.

Pipeline objects enable us to integrate a series of transformers that combine normalization, vectorization, and feature analysis into a unmarried, well-defined mechanism. As shown in Figure four-6, Pipeline objects motion data from a loader (an object that will wrap our CorpusReader from Chapter 2) into feature extraction mechanisms to finally an figurer object that implements our predictive models. Pipelines are directed acyclic graphs (DAGs) that tin can exist simple linear bondage of transformers to arbitrarily complex branching and joining paths.

Pipelines implement a DAG of data from data loading through feature extraction to a final estimator. Pipelines can be arbitrarily complex or simple linear structures.

Pipeline Nuts

The purpose of a Pipeline is to chain together multiple estimators representing a stock-still sequence of steps into a unmarried unit of measurement. All estimators in the pipeline, except the last one, must be transformers—that is, implement the transform method, while the terminal figurer tin can be of any type, including predictive estimators. Pipelines provide convenience; fit and transform can exist called for unmarried inputs across multiple objects at once. Pipelines also provide a single interface for grid search of multiple estimators at once. Most importantly, pipelines provide operationalization of text models by coupling a vectorization methodology with a predictive model.

Pipelines are constructed by describing a list of (central, value) pairs where the central is a cord that names the footstep and the value is the estimator object. Pipelines tin be created either by using the make_pipeline helper office, which automatically determines the names of the steps, or by specifying them directly. Generally, it is better to specify the steps directly to provide good user documentation, whereas make_pipeline is used more than often for automatic pipeline construction.

Pipeline objects are a Scikit-Learn specific utility, simply they are also the critical integration point with NLTK and Gensim. Hither is an case that joins the TextNormalizer and GensimVectorizer nosotros created in the last section together in advance of a Bayesian model. Past using the Transformer API every bit discussed earlier in the chapter, we can utilise TextNormalizer to wrap NLTK CorpusReader objects and perform preprocessing and linguistic feature extraction. Our GensimVectorizer is responsible for vectorization, and Scikit-Learn is responsible for the integration via Pipelines, utilities like cross-validation, and the many models we will use, from Naive Bayes to Logistic Regression.

              from              sklearn.pipeline              import              Pipeline              from              sklearn.naive_bayes              import              MultinomialNB              model              =              Pipeline              ([              (              'normalizer'              ,              TextNormalizer              ()),              (              'vectorizer'              ,              GensimVectorizer              ()),              (              'bayes'              ,              MultinomialNB              ()),              ])

The Pipeline can and then be used equally a unmarried case of a complete model. Calling model.fit is the same equally calling fit on each figurer in sequence, transforming the input and passing it on to the side by side step. Other methods like fit_transform bear similarly. The pipeline will as well have all the methods the final estimator in the pipeline has. If the final calculator is a transformer, so too is the pipeline. If the concluding estimator is a classifier, every bit in the example in a higher place, so the pipeline will also have predict and score methods so that the entire model can be used every bit a classifier.

The estimators in the pipeline are stored as a list, and can be accessed by index. For example, model.steps[1] returns the tuple ('vectorizer', GensimVectorizer(path=None)). However, common usage is to identify estimators by their names using the named_steps dictionary property of the Pipeline object. The easiest way to access the predictive model is to use model.named_steps["bayes"] and fetch the estimator direct.

Grid Search for Hyperparameter Optimization

In Affiliate 5, we volition talk more virtually model tuning and iteration, but for now we'll simply innovate an extension of the Pipeline, GridSearch, which is useful for hyperparameter optimization. Grid search can exist implemented to alter the parameters of all estimators in the Pipeline equally though it were a single object. In guild to admission the attributes of estimators, you would use the set_params or get_params pipeline methods with a dunderscore representation of the estimator and parameter names as follows: estimator__parameter.

Permit's say that we want to one-hot encode only the terms that announced at least iii times in the corpus; we could modify the Binarizer as follows:

              model              .              set_params              (              onehot__threshold              =              three.0              )

Using this principle, nosotros could execute a filigree search by defining the search parameters grid using the dunderscore parameter syntax. Consider the post-obit grid search to determine the all-time one-hot encoded Bayesian text classification model:

              from              sklearn.model_selection              import              GridSearchCV              search              =              GridSearchCV              (              model              ,              param_grid              =              {              'count__analyzer'              :              [              'word'              ,              'char'              ,              'char_wb'              ],              'count__ngram_range'              :              [(              1              ,              1              ),              (              1              ,              2              ),              (              1              ,              3              ),              (              1              ,              four              ),              (              1              ,              5              ),              (              2              ,              3              )],              'onehot__threshold'              :              [              0.0              ,              1.0              ,              2.0              ,              iii.0              ],              'bayes__alpha'              :              [              0.0              ,              one.0              ],              })

The search nominates three possibilities for the CountVectorizer analyzer parameter (creating due north-grams on word boundaries, grapheme boundaries, or only on characters that are between give-and-take boundaries), and several possibilities for the n-gram ranges to tokenize confronting. We besides specify the threshold for binarization, meaning that the n-gram has to announced a certain number of times before it's included in the model. Finally the search specifies two smoothing parameters (the bayes_alpha parameter): either no smoothing (add 0.0) or Laplacian smoothing (add 1.0).

The grid search volition instantiate a pipeline of our model for each combination of features, then employ cantankerous-validation to score the model and select the all-time combination of features (in this case, the combination that maximizes the F1 score).

Enriching Characteristic Extraction with Feature Unions

Pipelines do not accept to be simple linear sequences of steps; in fact, they can be arbitrarily complex through the implementation of feature unions. The FeatureUnion object combines several transformer objects into a new, single transformer similar to the Pipline object. Withal, instead of fitting and transforming information in sequence through each transformer, they are instead evaluated independently and the results are concatenated into a composite vector.

Consider the example shown in Figure 4-vii. We might imagine an HTML parser transformer that uses BeautifulSoup or an XML library to parse the HTML and return the body of each document. We then perform a characteristic engineering science step, where entities and keyphrases are each extracted from the documents and the results passed into the feature union. Using frequency encoding on the entities is more sensible since they are relatively minor, just TF–IDF makes more sense for the keyphrases. The feature union so concatenates the two resulting vectors such that our determination space alee of the logistic regression separates word dimensions in the championship from give-and-take dimensions in the trunk.

Feature unions allow arbitrarily complex pipelines by implementing transformer methods in parallel, concatenating the resulting vectors as final output.

FeatureUnion objects are similarly instantiated every bit Pipeline objects with a list of (key, value) pairs where the key is the name of the transformer, and the value is the transformer object. There is also a make_union helper office that can automatically determine names and is used in a similar style to the make_pipeline helper part—for automatic or generated pipelines. Estimator parameters can too exist accessed in the same fashion, and to implement a search on a feature union, only nest the dunderscore for each transformer in the feature union.

Given the unimplemented EntityExtractor and KeyphraseExtractor transformers mentioned to a higher place, we can construct our pipeline equally follows:

              from              sklearn.pipeline              import              FeatureUnion              from              sklearn.linear_model              import              LogisticRegression              model              =              Pipeline              ([              (              'parser'              ,              HTMLParser              ()),              (              'text_union'              ,              FeatureUnion              (              transformer_list              =              [              (              'entity_feature'              ,              Pipeline              ([              (              'entity_extractor'              ,              EntityExtractor              ()),              (              'entity_vect'              ,              CountVectorizer              ()),              ])),              (              'keyphrase_feature'              ,              Pipeline              ([              (              'keyphrase_extractor'              ,              KeyphraseExtractor              ()),              (              'keyphrase_vect'              ,              TfidfVectorizer              ()),              ])),              ],              transformer_weights              =              {              'entity_feature'              :              0.six              ,              'keyphrase_feature'              :              0.2              ,              }              )),              (              'clf'              ,              LogisticRegression              ()),              ])

Note that the HTMLParser, EntityExtractor and KeyphraseExtractor objects are currently unimplemented but are used for analogy. The feature union is fit in sequence with respect to the rest of the pipeline, but each transformer within the characteristic wedlock is fit independently, meaning that each transformer sees the aforementioned data equally the input to the feature matrimony. During transformation, each transformer is applied in parallel and the vectors that they output are concatenated together into a single larger vector, which can be optionally weighted, as shown in Figure 4-8.

In this example, we see the process of extracting entities and keyphrases from the original documents, and then joining them in a feature union ahead of vectorization and modeling.

In this example, we are weighting the entity_feature transformer more than the keyphrase_feature transformer. Using combinations of custom transformers, feature unions, and pipelines, it is possible to define incredibly rich feature extraction and transformation in a repeatable mode. By collecting our methodology into a unmarried sequence, we can repeatably apply the transformations, especially on new documents when we want to brand predictions in a production surroundings.

Conclusion

In this chapter, nosotros conducted a whirlwind overview of vectorization techniques and began to consider their use cases for different kinds of data and different automobile learning algorithms. In practice, information technology is all-time to select an encoding scheme based on the trouble at hand; certain methods substantially outperform others for sure tasks.

For instance, for recurrent neural network models information technology is oftentimes ameliorate to utilise i-hot encoding, only to split up the text infinite one might create a combined vector for the certificate summary, document header, body, etc. Frequency encoding should be normalized, just different types of frequency encoding tin can benefit probabilistic methods similar Bayesian models. TF–IDF is an excellent general-purpose encoding and is oft used first in modeling, but tin also cover a lot of sins. Distributed representations are the new hotness, merely are performance intensive and difficult to calibration.

Bag-of-words models have a very loftier dimensionality, pregnant the space is extremely thin, leading to difficulty generalizing the data space. Give-and-take order, grammer, and other structural features are natively lost, and it is difficult to add knowledge (e.m., lexical resource, ontological encodings) to the learning process. Local encodings (e.g., nondistributed representations) require a lot of samples, which could lead to overtraining or underfitting, but distributed representations are complex and add a layer of "representational mysticism."

Ultimately, much of the work for language-enlightened applications comes from domain-specific characteristic assay, not but simple vectorization. In the final department of this chapter nosotros explored the utilise of FeatureUnion and Pipeline objects to create meaningful extraction methodologies by combining transformers. As nosotros move forward, the practice of edifice pipelines of transformers and estimators will continue to be our primary mechanism of performing machine learning. In Chapter 5 nosotros will explore classification models and applications, then in Affiliate 6 nosotros will take a expect at clustering models, ofttimes called topic modeling in text analysis. In Affiliate 7, we volition explore some more complex methods for characteristic analysis and feature exploration that will assist in finetuning our vector-based models to achieve amend results. Even so, simple models that only consider give-and-take frequencies are often very successful. In our experience, a pure bag-of-words model works about 85% of the fourth dimension!

^one Quoc Five. Le and Tomas Mikolov, Distributed Representations of Sentences and Documents, (2014) http://bit.ly/2GJBHjZ

Construct an Index Vector From Two Input Vectors in Vectorized Fashion

Affiliate iv. Text Vectorization and Transformation Pipelines

Words in Infinite

Figure 4-1. Encoding documents equally vectors

Frequency Vectors

Figure 4-ii. Token frequency as vector encoding

With NLTK

In Scikit-Learn

Notation

The Gensim way

One-Hot Encoding

Figure iv-3. 1-hot encoding

With NLTK

In Scikit-Learn

Caution

The Gensim way

Term Frequency–Inverse Document Frequency

Figure iv-iv. TF–IDF encoding

With NLTK

In Scikit-Acquire

The Gensim style

Distributed Representation

Effigy 4-5. Distributed representation

The Gensim way

Annotation

The Scikit-Learn API

The BaseEstimator Interface

Note

Extending TransformerMixin

Creating a custom Gensim vectorization transformer

Creating a custom text normalization transformer

Caution

Pipelines

Figure four-vi. Pipelines for text vectorization and feature extraction

Pipeline Nuts

Grid Search for Hyperparameter Optimization

Enriching Characteristic Extraction with Feature Unions

Figure four-7. Feature unions for branching vectorization

Effigy iv-8. Feature extraction and union

Conclusion

0 Response to "Construct an Index Vector From Two Input Vectors in Vectorized Fashion"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel