Introduction to Natural Language Processing for Text

Natural Language Processing is used to apply machine learning algorithms to text and speech. For example, we can use it to create systems like speech recognition, document summarization, machine translation, spam detection, named entity recognition, question answering, autocomplete, predictive typing and so on.

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to many corpora and lexical resources. Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open source, community-driven project.

In this article, we’ll cover the following topics. 这些功能都是可以使用nltk 进行实现的。 text Lemmatization 比如，单词“cars”词形还原后的单词为“car”，单词“ate”词形还原后的单词为“eat”。

Sentence Tokenization

段落成句。 Sentence tokenization (also called sentence segmentation) is the problem of dividing a string of written language into its component sentences. The idea here looks very simple. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark.（标点符号）

Word Tokenization

句子成词，颗粒度变得更小。 Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words. In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.

Text Lemmatization 词性还原 and Stemming 词干提取

这种操作如果被认为是一种 normalization，那么一个优点就是加快了运行的速度。从不同的形式到统一的形式，这可以认为减少了变量。感觉这个更加涉及语法，语法树之类的东西。 For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality.

Stemming and lemmatization are special cases of normalization. However, they are different from each other.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Stop Words 因为 stop words往往是带了 noise rather than useful information，所以这个是要去掉的。 Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

stop words dictionary 可以理解成一种过滤词表，是可以根据应用的不同，然后 change的。 Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.

在存储 stopword 的时候使用 set rather than list 主要原因是 much faster than search operations in a set. You might wonder why we convert our list into a set. Set is an abstract data type that can store unique values, without any particular order. The search operation in a set is much faster than the search operation in a list. For a small number of words, there is no big difference, but if you have a large number of words it’s highly recommended to use the set type.

Regex

A kind of search pattern. A regular expression, regex, or regexp is a sequence of characters that define a search pattern. Let’s see some basics.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


. - match any character except newline
\w - match word
\d - match digit
\s - match whitespace
\W - match not word
\D - match not digit
\S - match not whitespace
[abc] - match any of a, b, or c
[^abc] - not match a, b, or c
[a-g] - match a character between a & g

这个解释说明了为什么在正则表达式中使用 r"" 作为一种前缀。因为正则表达是中 ”\“ 的使用和 python 中的"" 使用有冲突。简而言之，如果加上了 r"" 那么这个就是一种完全的正则表达式的语法了。

Regular expressions use the backslash character ('') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write ‘\’ as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal. The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with ‘r’. So r"\n" is a two-character string containing ‘' and ‘n’, while “\n” is a one-character string containing a newline. Usually, patterns will be expressed in Python code using this raw string notation.

An example,

1
2
3
4


	import re
	sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing."
	pattern = r"[^\w]"
	print(re.sub(pattern, " ", sentence))

Bag of words

Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called feature extraction. The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

这个是 bag of words的”特点“： order or structure of words 没有体现出来。 Any information about the order or structure of words is discarded. That’s why it’s called a bag of words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.

The intuition is that similar documents have similar contents. Also, from a content, we can learn something about the meaning of the document.

To use this model, we need to:

Design a vocabulary of known words (also called tokens)
Choose a measure of the presence of known words

最简单的方式是 “occurrence” ，如果出现了标为1 否则标为0；这种是最为简单的 bag of words 最的方式，这四个是一一对应的。注意体会。

The complexity of the bag-of-words model comes in deciding how to design the vocabulary of known words (tokens) and how to score the presence of known words.

bag of words 中使用 “occurrence” 的方式的缺点：稀疏矩阵（当dict 很大的时候，文章的 representation中有相当成分的0）。

In some cases, we can have a huge amount of data and in this cases, the length of the vector that represents a document might be thousands or millions of elements. Furthermore, each document may contain only a few of the known words in the vocabulary. Therefore the vector representations will have a lot of zeros. These vectors which have a lot of zeros are called sparse vectors. They require more memory and computational resources. We can decrease the number of the known words when using a bag-of-words model to decrease the required memory and computational resources. We can use the text cleaning techniques we’ve already seen in this article before we create our bag-of-words model:

减少 dictionary size 的方式。

Ignoring punctuation Removing the stop words from our documents Reducing the words to their base form (Text Lemmatization and Stemming) Fixing misspelled words

n-gram 的思想是很广泛：通过 sequence of words，这个是可以增加文本的表达力的。 An n-gram is a sequence of a number of items (words, letter, numbers, digits, etc.). In the context of text corpora, n-grams typically refer to a sequence of words. A unigram is one word, a bigram is a sequence of two words, a trigram is a sequence of three words etc.

关于如何去 score the presence of word：这里是有三种方式的。 We saw one very simple approach - the binary approach (1 for presence, 0 for absence). Some additional scoring methods are: 2) Counts. Count the number of times each word appears in a document. 3) Frequencies. Calculate the frequency that each word appears in document out of all the words in the document.

TF-IDF 这个语境是相对于 frequency 而言的，关键词是不一定有频率所决定，而一些 rarer or domain-specific words 可能是更加常见的。 One problem with scoring word frequency is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much “informational gain” to the model compared with some rarer and domain-specific words. One approach to fix that problem is to penalize words that are frequent across all the documents. This approach is called TF-IDF.

TF-IDF 的关键在于体现了“语料库”。 TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

参考资料 https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

文章目录