3.6 Text Normalization 3.7 Regular Expressions for Tokenizing Text

3.6 Normalizing Text Text normalization

In previous program examples, it was common to convert text to lowercase before doing anything with the word.

 set(w.lower()for w for text)
lower()

I used the above command to make the text lowercase *** normalization *** so that The and the are considered the same. In many cases, go beyond this and remove the suffix, a task known as the suffix. A further step is to make sure that the resulting form is a known word in the dictionary. This is a task called *** heading ***. These will be explained in order. First, you need to define the data used in this section.

>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
>>> tokens = word_tokenize(raw)

Stemmers

NLTK contains several off-the-shelf stemmers.

Stemmers is stem processing << Search engines recognize the stem from the search term and search including inflected forms and derivative forms >>

If you need a stemmer, use one of them to take precedence over your own creation using regular expressions. Porter Stemmer and Lancaster Stemmer follow their own rules for removing affixes.

• Porter stemming is a method of removing common morphological and inflectional endings from English words, and is the most commonly used and easiest stemmer. -Lancaster stemming is the fastest calculation, but the result may be problematic.

The Porter Stemmer handles the wrong word correctly, but the Lancaster Stemmer does not.

>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>> [porter.stem(t) for t in tokens]
['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond',
'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern',
'.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from',
'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']
>>> [lancaster.stem(t) for t in tokens]
['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut',
'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem',
'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not',
'from', 'som', 'farc', 'aqu', 'ceremony', '.']

Stemming is not a well-defined process and usually selects the best stemmer for the application you are thinking of. Porter Stemmer is suitable when you want to index some text and support searches using alternative forms of words.

class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()
>>> porter = nltk.PorterStemmer()
>>> grail = nltk.corpus.webtext.words('grail.txt')
>>> text = IndexedText(porter, grail)
>>> text.concordance('lie')
r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t

Figure 3.6: Indexing text using stemmers

Lemmatization

WordNet lemmatizer removes affixes only if the resulting word is in the dictionary.

WordNet is an English concept dictionary (semantic dictionary) lemmatizer is to convert to a headword. Example meet, meeting) I'll attend meeting. After conversion --meeting I met him last night. After conversion --meet

This additional checking process makes the lemmatizer slower than the stemmers above. Don't lie, but change wom * e * n to wom * a * n.

>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(t) for t in tokens]
['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond',
'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of',
'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a',
'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical',
'aquatic', 'ceremony', '.']

WordNet lemmatizer is suitable when you want to edit the vocabulary of some texts and need a list of valid lemmas.

Lemma is a word in the form of a headword or dictionary.

3.7 Regular expressions for tokenizing text

** Tokenization ** is the task of cutting a lexical string into identifiable linguistic units that make up part of the linguistic data. This is a basic task, but it could be delayed so far because many corpora are already ** tokenized ** and NLTK includes some tokenization. .. Being familiar with regular expressions, you can learn how to use regular expressions to ** tokenize ** text and gain more control over the process.

A token is a character or character string that is treated as the smallest unit of a sentence when analyzing natural language.

A simple approach to tokenization

The easiest way to ** tokenize ** text is to split it with whitespace. Consider the following text of Alice's Adventures in Wonderland.

>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""

You can use raw.split () to split this raw text into whitespace. To do the same thing with a regular expression, it's not enough to match the space character in the string ** [1] **. This is because it will generate a token that contains a \ n newline character. Instead, you need to match any number of spaces, tabs, or newlines ** [2] **:

>>> re.split(r' ', raw) [1]
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',
'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper',
'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe',
"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
>>> re.split(r'[ \t\n]+', raw) [2]
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',
'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper',
'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',
"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

The regular expression «[\ t \ n] +» matches one or more spaces, tabs (\ t), or newlines (\ n). Other whitespace characters such as carriage returns and form feeds must actually be included.

Carriage return is a control character that indicates that the cursor is returned to the beginning of a sentence in the character code system.

Instead, use the built-in abbreviation \ s. This means any whitespace character. The above statement can be rewritten as re.split (r'\ s +', raw).

Important: Don't forget to precede the regular expression with the letter r. This tells the Python interpreter to treat the string as a literal instead of handling the contained backslash character.

Splitting with whitespace gives tokens such as'(not',' herself,', etc. Instead, Python provides a character class \ w for the word character equivalent to [a-zA-Z0-9_]. You can also use facts. It also defines a complement to this class \ W, that is, all letters except letters, numbers, or underscores. Use \ W in a simple regular expression to use the word letter. You can split inputs other than.

Underscore means "_". Same as underscore.

>>> re.split(r'\W+', raw)
['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in',
'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper',
'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without',
'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered',
'']

Note that this will give you an empty string at the start and end. Get the same token, but use re.findall (r'\ w +', raw), use a pattern that matches the word instead of spaces, and don't use the empty string. Having matched the words, we are in a position to extend the regular expression to cover a wider range of cases. The regular expression «\ w + | \ S \ w *» first attempts to match a sequence of word letters.

A sequence is a series of data and procedures arranged in order, and a processing method that handles data and procedures in the order in which they are arranged.

If no match is found, it attempts to match any non-whitespace character (\ S is the complement of \ s) followed by another word character. That is, punctuation is grouped by the following character (eg's), but the sequence of two or more punctuation characters is separated.

>>> re.findall(r'\w+|\S\w*', raw)
["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',
'(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",
'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does',
'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that',
'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

Generalize \ w + in the above formula to allow hyphens and apostrophes inside words: «\ w + ([-'] \ w +) *». This expression means that \ w + is followed by zero or more instances of [-'] \ w +. It matches the hot temper. Also, add a pattern that matches the quoted character so that it is kept separate from the text that the quoted character encloses.

>>> print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))
["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',
'(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I',
"won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup',
'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper',
'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']

The above formula is «[-. It also contains (] + », which individually tokenizes double hyphens, ellipsis, and braces. Lists the regular expression character class symbols seen in this section of 3.4 and other useful symbols. It has been.

Table 3.4: Regular expression symbols

Symbol Function
\b Word boundary (zero width)
\d Any decimal number ([0-9]Equivalent to)
\D Characters other than numbers ([^ 0-9]Equivalent to)
\s Whitespace ([\ t \ n \ r \ f \ v]Equivalent to)
\S Non-whitespace characters[^ \ t \ n \ r \ f \ v]Equivalent to)
\w Any alphanumeric characters ([a-zA-Z0-9_]Equivalent to)
\W Any non-alphanumeric ([^ a-zA-Z0-9_]Equivalent to)
\t Tab character
\n Newline character

NLTK Regular Expression Tokenizer

The function nltk.regexp_tokenize () is similar to re.findall () (because it is used for tokenization). However, nltk.regexp_tokenize () is more efficient for this task and avoids the need for special handling of parentheses. For readability, split the regular expression into multiple lines and add a comment for each line. A special "verbose flag" tells Python to remove embedded whitespace and comments.

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)     # set flag to allow verbose regexps
...     (?:[A-Z]\.)+       # abbreviations, e.g. U.S.A.
...   | \w+(?:-\w+)*       # words with optional internal hyphens
...   | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.             # ellipsis
...   | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

When using the detail flag, you can no longer use "" to match the space character. Use \ s instead. The regexp_tokenize () function has an optional gaps parameter. When set to True, the regular expression specifies the gap between tokens, similar to re.split ().

Other issues with tokenization

Tokenization turned out to be a much more difficult task than expected. There is no single solution that works perfectly well, so depending on your application domain, you need to decide what to count as tokens.

When developing a tokenizer, it is helpful to have access to manually tokenized raw text to compare the tokenizer output with high quality (or "gold standard") tokens. The NLTK Corpus Collection contains samples of Penn Treebank data, including raw Wall Street Journal text (nltk.corpus.treebank_raw.raw ()) and tokenized versions (nltk.corpus.treebank.words ()). I will.

The final problem with tokenization is the existence of contractions. If you are analyzing the meaning of a sentence, it is probably more convenient to normalize this form into two separate forms: did and n't (or not). You can do this with a look-up table.

A Lookup table is data such as an array or an associative array that makes the calculation process an array reference process. For example, if you select an item in a database and want to retrieve the data corresponding to that item, you can save the data corresponding to the item as a lookup table in advance and refer to the value corresponding to the item from the lookup table. As a result, the data corresponding to the item can be requested. Since it is not necessary to perform the calculation every time the request is made, the calculation load on the computer can be reduced and the processing can be performed efficiently.

Recommended Posts

3.6 Text Normalization 3.7 Regular Expressions for Tokenizing Text
By language: Regular expressions for passwords
[Python] Regular Expressions Regular Expressions
Python pandas: Search for DataFrame using regular expressions
Call Python library for text normalization from MATLAB
Text mining (for memos)
Use regular expressions in C
Use regular expressions in Python
Extract numbers with regular expressions
About Python and regular expressions