Tokenization (2024)

Next: Dropping common terms: stop Up: Determining the vocabulary of Previous: Determining the vocabulary of Contents Index

Given a character sequence and a defined document unit, tokenization isthe task of chopping it up into pieces, called tokens , perhapsat the same time throwing away certain characters, such as punctuation.Here is an example of tokenization:

Input: Friends, Romans, Countrymen, lend me your ears;
Output:

These tokens are often loosely referred to as terms orwords, but it is sometimes important to make atype/token distinction. A token is an instance of asequence of characters in some particular document thatare grouped together as a useful semantic unit forprocessing. A type is the class of all tokenscontaining the same character sequence. A term is a(perhaps normalized) type that is included in the IR system'sdictionary. The set of index terms could be entirelydistinct from the tokens, for instance, they could besemantic identifiers in a taxonomy, but in practice inmodern IR systems they are strongly related to the tokens inthe document. However, rather than being exactly the tokensthat appear in the document, they are usually derived from themby various normalization processes which are discussed inSection 2.2.3 .For example, if the document to be indexed is to sleepperchance to dream, then there are 5 tokens, but only 4types (since there are 2 instances of to). However, ifto is omitted from the index (as a stop word, seeSection2.2.2 (page

)), then there will be only 3 terms:sleep, perchance, and dream.

The major question of the tokenization phase is what are the correct tokens touse? In this example, it looksfairly trivial: you chop on whitespace and throw away punctuationcharacters. This is a starting point, but even for English there are anumber of tricky cases. For example, what do you do about the varioususes ofthe apostrophe for possession and contractions?

Mr. O'Neillthinks that the boys' stories about Chile's capital aren't amusing.

For O'Neill, which of the following is the desired tokenization?

?

And for aren't, is it:

?

A simple strategy is to just split on all non-alphanumeric characters, but while

looks okay,

looks intuitively bad. For all of them, the choices determinewhich Boolean queries will match. A query of neill ANDcapital will match in three cases but not the other two. In howmany cases would a query of o'neill AND capitalmatch? If no preprocessing of a query is done, then it would matchin only one of the five cases. For eitherBoolean or free text queries, you always want to dothe exact same tokenization of document and query words, generally by processing queries with the same tokenizer.This guarantees that a sequence of characters in a text will always match thesame sequence typed in a query.

These issues of tokenization are language-specific. It thusrequires the language of the document to be known. Language identification based on classifiers that use short character subsequences as features is highly effective; most languages have distinctivesignature patterns (see page 2.5 for references).

For most languages and particular domains within them there are unusualspecific tokens that we wish to recognize as terms, such as the programming languagesC++ and C#, aircraft names like B-52, or aT.V. show name such as M*A*S*H - which is sufficiently integrated into popular culture that you findusages such as M*A*S*H-style hospitals. Computer technology has introducednew types of character sequences that a tokenizer should probably tokenize as a single token, including email addresses ([email protected]), web URLs (http://stuff.big.com/new/specials.html), numeric IP addresses (142.32.48.231),package tracking numbers (1Z9999W99845399981),and more. One possible solution is to omit from indexingtokens such as monetary amounts,numbers, and URLs, since their presence greatly expands the size of thevocabulary. However, this comes at a large cost in restricting whatpeople can search for. For instance, people might want to search in abug database for the line number where an error occurs.Items such as the date of an email, which have a clear semantic type,are often indexed separately as document metadata parametricsection.

In English, hyphenation is used for various purposesranging from splitting up vowels in words (co-education)to joining nouns as names (Hewlett-Packard) to acopyediting device to show word grouping (thehold-him-back-and-drag-him-away maneuver). It is easy to feel that thefirst example should be regarded as one token (and is indeed more commonlywritten as just coeducation), the last should be separated intowords, and that the middle case is unclear. Handling hyphensautomatically can thus be complex: it can either be done as aclassification problem, or more commonly by some heuristic rules, suchas allowing short hyphenated prefixes on words, but not longerhyphenated forms.

Conceptually, splitting on white space can also split what should beregarded as a single token. This occurs most commonly with names(San Francisco, Los Angeles) but also with borrowed foreign phrases(au fait) and compounds that are sometimes written as a singleword and sometimes space separated (such as white space vs. whitespace). Other cases with internal spaces that wemight wish to regard as a single token include phone numbers((800)234-2333) and dates (Mar11,1983).Splitting tokens on spaces can cause bad retrieval results, for example,if a search for York University mainly returns documents containingNew York University.The problems of hyphens and non-separating whitespace can even interact. Advertisem*nts for air fares frequentlycontain items like San Francisco-Los Angeles, where simply doingwhitespace splitting would give unfortunate results. In such cases,issues of tokenization interact with handling phrase queries (which wediscuss in Section2.4 (page)), particularly if we would like queries forall of lowercase, lower-case and lower case toreturn the same results. The last two can be handled by splitting onhyphens and using a phrase index.Getting the first case right would depend on knowing thatit is sometimes written as two words and also indexing it in this way.One effective strategy in practice, which is used by some Boolean retrieval systems such as Westlaw and Lexis-Nexis(westlaw), is to encourageusers to enter hyphens wherever they may be possible, and whenever there isa hyphenated form, the system will generalize the queryto cover all three of the one word, hyphenated, and two word forms, so that a query for over-eager will search for over-eager OR ``over eager'' OR overeager.However, this strategy depends on user training, since if you query usingeither of the other two forms, you get no generalization.

Each newlanguage presents some new issues. For instance,French has a variant use of the apostrophe for a reduced definite article the before a wordbeginning with a vowel (e.g., l'ensemble) and has some uses ofthe hyphen with postposed cl*tic pronouns in imperatives and questions (e.g.,donne-moi give me). Getting the first case correct will affect the correct indexing of a fair percentage of nouns andadjectives: you would want documents mentioning both l'ensembleand un ensemble to be indexed under ensemble. Otherlanguages make the problem harder in new ways. German writes compound nouns without spaces (e.g.,Computerlinguistik `computational linguistics'; Lebensversicherungsgesellschaftsangestellter`life insurance company employee'). Retrieval systems for German greatly benefitfrom the use of a compound-splitter module, which is usually implementedby seeing if a word can be subdivided into multiple words that appear in a vocabulary. This phenomenon reaches its limitcase with major East Asian Languages (e.g., Chinese, Japanese, Korean,and Thai), where text is written without any spaces between words. Anexample is shown in Figure 2.3 . One approach here is toperform word segmentation as prior linguistic processing. Methods of word segmentation vary from having a largevocabulary and taking the longest vocabulary match with some heuristicsfor unknown words to the use of machine learning sequence models, suchas hidden Markov models or conditional random fields, trained overhand-segmented words (see the references in Section 2.5 ).Since there are multiple possible segmentations of character sequences (see Figure 2.4 ), all such methods make mistakes sometimes, and soyou are never guaranteed a consistent unique tokenization. The otherapproach is to abandon word-based indexing and to do all indexing viajust short subsequences of characters (character -grams), regardless ofwhether particular sequences cross word boundaries or not. Three reasonswhy this approach is appealing are that an individual Chinese characteris more like a syllable than a letter and usually has some semanticcontent, that most words are short (the commonest length is 2characters), and that, given the lack ofstandardization of word breaking in the writing system, it is not alwaysclear where word boundaries should be placed anyway. Even in English,some cases of where to put word boundaries are just orthographicconventions - think of notwithstanding vs. not to mentionor into vs. on to - but people are educated to write thewords with consistent use of spaces.

The standard unsegmented form of Chinese text using the simplified characters of mainland China.There is no whitespace between words, not even between sentences - the apparent space after the Chinese period () is just a typographical illusion caused by placing the character on the left side of its square box. The first sentence is just words in Chinese characters with no spaces between them. The second and third sentences include Arabic numerals and punctuation breaking up the Chinese characters.

Ambiguities in Chinese word segmentation.The two characters can be treated as one word meaning `monk' or as a sequence of two words meaning `and' and `still'.

Next: Dropping common terms: stop Up: Determining the vocabulary of Previous: Determining the vocabulary of Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07

FAQs

Tokenization? ›

Tokenization is the process of exchanging sensitive data for nonsensitive data called “tokens” that can be used in a database or internal system without bringing it into scope.

Read On ›

What is tokenization for dummies? ›

In general, tokenization is the process of issuing a digital, unique, and anonymous representation of a real thing. In Web3 applications, the token is used on a (typically private) blockchain, which allows the token to be used within specific protocols.

Discover More Details ›

What is a simple example of tokenization? ›

For example, tokenizing the sentence “I love ice cream” would result in three tokens: “I,” “love,” and “ice cream.” It's a fundamental step in natural language processing and text analysis tasks.

What is tokenization in banking and finance? ›

A blockchain-based technology, tokenization, can help deliver value in these areas, right now. Tokenization lets you digitally represent asset ownership for any tangible or intangible asset — stocks or bonds, cash or cryptocurrency, data sets or loyalty points — on a blockchain.

See Details ›

What is the difference between tokenization and blockchain? ›

Tokenization refers to the registration of asset ownership on blockchain infrastructure. In tokenized form, assets can potentially benefit from a blockchain's functionality, including more efficient settlement and the ability to interact with smart contracts.

Find Out More ›

What is tokenization in simple words? ›

Tokenization refers to a process by which a piece of sensitive data, such as a credit card number, is replaced by a surrogate value known as a token. The sensitive data still generally needs to be stored securely at one centralized location for subsequent reference and requires strong protections around it.

Tell Me More ›

How do you do tokenization? ›

In Python, tokenization in NLP can be accomplished using various libraries such as NLTK, SpaCy, or the tokenization module in the Transformers library. These libraries offer functions to split text into tokens, such as words or subwords, based on different rules and language-specific considerations.

Show Me More ›

What is an example of payment tokenization? ›

Payment Tokenization Example

When a merchant processes the credit card of a customer, the PAN is substituted with a token. 1234-4321-8765-5678 is replaced with, for example, 6f7%gf38hfUa. The merchant can apply the token ID to retain records of the customer, for example, 6f7%gf38hfUa is connected to John Smith.

Explore More ›

How does tokenization work in payments? ›

Payment tokenization is the process by which sensitive personal information is replaced with a surrogate value — a token. That replaced value is stored in a PCI-compliant token vault owned by the token creator, which can be an entity such as an acquirer, issuer, network or payment processor.

Why do we do tokenization? ›

Tokenization breaks text into smaller parts for easier machine analysis, helping machines understand human language. Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens.

Show Me More ›

How safe is tokenization? ›

Tokenization and PCI DSS

This means that only the tokenized data is within the audit scope, while the actual sensitive information is securely stored off-site. For those in fintech, insurance, and healthcare, tokenization provides a secure framework for transactions, ensuring compliance with PCI DSS requirements.

Read The Full Story ›

How to tokenize an asset? ›

How To Tokenize An Asset

Select the Asset to Tokenize. The first step is to identify the asset that you want to tokenize. ...
Define Token Type. ...
Choose the Blockchain You Want to Issue Your Tokens On. ...
Select a Third-Party Auditor To Verify Off-Chain Assets. ...
Use Chainlink Proof of Reserve To Help Secure the Minting of the Tokens.

Nov 30, 2023

See Details ›

Is tokenization the future? ›

Future Outlook

As blockchain and decentralized technologies evolve, tokenization will likely become more mainstream, impacting various industries beyond finance. In conclusion, tokenization is not merely a trend but a transformative force shaping the future of finance.

Get More Info Here ›

What is the difference between digitization and tokenization? ›

The key difference between the two is in extensivity. Digitalization involves the full digitization of an asset or card, while tokenization involves the creation of a token with some piece of sensitive credit card data.

What is the goal of tokenization? ›

The primary goal of tokenization is to represent text in a manner that's meaningful for machines without losing its context. By converting text into tokens, algorithms can more easily identify patterns.

What is tokenization in real world? ›

Exploring the Emerging Token Economy In this ebook, we will introduce you to real-world asset tokenization: the digitization of physical or traditional assets represented on decentralized infrastructure such as blockchain or similar forms of distributed ledger technology (DLT).

View Details ›

What is the difference between encryption and tokenization? ›

Tokenization focuses on replacing data with unrelated tokens, minimizing the exposure of sensitive information, and simplifying compliance with data protection regulations. Encryption, on the other hand, secures data by converting it into an unreadable format, necessitating a decryption key for access.