Applied Data Science

Overview
Data science is the combination of data processing, machine learning, data mining, statistics, and predictive analytics. One of the key elements is machine learning, born from the marriage of pattern recognition and computational learning theory, that enables software to recognize, interpret and analyze new rules or behavior without being specifically programmed to do so, and react accordingly based on predictable human patterns.

Data Science at NTENT falls into three major categories: Natural Language Processing (NLP), Data Analytics and Data Acquisition. The key functionality of the NLP engine is to semantically interpret (i.e. understand and represent the meaning of) natural language inputs of arbitrary complexity and type. These interpretations are then used to gauge user search intent, market-specific relevance, and provide appropriate results on that basis. Based on the multi-phase approach informed by hybrid NLP methods, NTENT’s patented technology sifts through text and usage data collected all around the world. The key functionality of Data Analytics is tracking and modeling system and product behavior by leveraging usage/user data and performing market analysis.

Data Analytics and Acquisition
The first area where Data Science is applied is in Data Acquisition. There are three main sources of data we acquire: content used in response to user requests, knowledge used in reasoning and language understanding, and usage data. The data we acquire is used to train various parts of the NTENT Search Platform. For example, content can provide the basis for language models which help in understanding word and concept popularity. Knowledge is used to augment the ability of the platform to understand the meaning of words. Finally, looking at what document is selected in response to a query provides important clues to help understand what people mean. For example, a search for “meatloaf recipe” is related to the meatloaf dish (not the rock band) and a good result may not include the word “recipe” but merely contain ingredients and instructions on how to prepare the dish.

Natural Language Processing (NLP)
NTENT’s NLP Engine is a semantic engine, i.e. it goes beyond the standard range of language processing tasks (tokenization, part-of-speech tagging, parsing, etc.) and generates semantic representations that are formal, computable and language-independent statements derived from natural language input of arbitrary complexity and type (e.g. queries, documents, chat/forum posts, etc.) Semantic representations serve as a lingua franca to various search and question answering applications that interact with the NLP engine.

Pre-processing steps include document format conversion, character set detection and conversion, HTML (or XML) DOM building, boilerplate detection and extraction of text from DOM. Additional pre-semantic processing steps include tokenization, POS-tagging, lemmatization and keyword ranking.

Tokenization
involves breaking the text into words on whitespace and handling punctuation like apostrophes and periods (a non-trivial issue). Things get more complicated in ideographic languages like Chinese. After this point, the indexer is largely concerned solely with the token stream. The standard tokenization steps include case-folding and character normalization, including diacritics. Tokenization algorithms are sensitive to language-specific behavior enabling the system,for example, to remove the optional tilde in English but preserve it in Spanish.

Part-of-speech tagging is a standard procedure of assigning a grammatical category to each token in the input stream. Part-of-speech tagging is a language-specific procedure, and appropriate algorithms are engaged for each language.

Lemmatization
deserves special attention here. The process involves “normalizing” expressions as they appear in input text to their standard, human-readable forms in which they are normally attested in the underlying knowledge resource. For example, the inflected noun “students’” in “students’ grades,” when lemmatized, becomes “student,” as its Genitive case is folded into Nominative and it’s plural form into a singular. Lemmatization is a key prerequisite for matching, i.e. linking surface tokens to ontological concepts.

Lemmatization of morphologically rich languages like Russian or Finnish is a complex issue because of the increased homonymy between grammatical forms. A Russian word “три” (i.e. /tri/), for example, could be the numeral “three” or the verb “rub” in the imperative form. Lemmatizing this word properly often requires considering additional morphological and syntactic properties of the context. NTENT’s proprietary resources include extensive grammatical dictionaries, i.e. inflection tables, for multiple languages, which enables our lemmatizer to accurately and efficiently identify lemmata, resorting to machine learning where needed.

Keyword ranking involves recording various forms of the tokens as keywords in the document and assigning ranks algorithmically. This step considers keywords to be more important when they appear more often in the document or in the title, and less important when they appear often in the corpus.

Semantic processing
involves a range of procedures of extracting underlying ontological concepts from the tokenized surface input stream. The resulting extractions may be standalone concepts or clusters of concepts related via ontological links. The semantic processing procedures unfold as part of a pipeline that aggregates multiple, concurrently engaged, processing and scoring components. Among the key semantic processing procedures are dynamic entity extraction, matching, scoring and deriving semantic representations, the ultimate output of the NLP engine.

Dynamic entity extraction
involves identifying sub-semantic structured entities like dates, phone numbers or zip codes. Named entities per se that are interpreted as instances of existing ontological classes are extracted by a separate component called Named Entity Recognizer.

The NLP Engine processes natural language input by performing concept matching and scoring. Concept matching comprises a series of steps from identifying simple, standalone concepts in the input stream to constructing compositional representations, i.e. sets of concepts linked through relations. The capabilities underlying concept matching and scoring are built fully in-house, draw heavily on the company’s proprietary knowledge resources and support a wide range of tasks from named entity recognition to fact extraction and compositional semantics. The latter two are very hard to attain in the industry and generally considered to be the “holy grail” of semantic natural language processing.

As part of the processing, the NLP engine is built to complete a number of special-purpose detectors and scorers which produce signals such as graph-based proximity, geographic proximity, cultural notability biases, default and ontological type reasoning, and language-specific expression statistics. Performing these tasks allows the engine to compute confidence on its output candidates, a critical precondition for many multi-modular systems.

It is hard to underestimate the importance of named entity recognition. One of the most challenging issues an open-world system like NTENT’s faces is the fact that just about any known word in any language might potentially acquire a new sense not previously attested in any existing resources. Consider, for example a scenario where one or more web documents appear that discuss a new, very recently formed but instantly popular Guamanian music band named Trout. Understanding queries like “trout album”, “trout lineup” or “trout concert” requires the ability to distinguish the novel sense of “trout” from established ones. Named Entity Recognizer is built precisely to provide NTENT’s semantic technology with this ability, determining that the sense of the word “trout” in the above queries is none of the previously defined ones and is likely the name of a group. Named Entity Recognizer thus acts as a “safety net” that bridges the lexical gap between the well-established vocabulary and the fluid, never-ending process of emerging new senses.

The ultimate decision-making component of the NLP engine is a machine learning module that relies on the signals produced by the various detectors and scorers. It determines the probability of every possible interpretation set, and therefore settles on one or more plausible interpretations of the input. The module is trained on semantic annotations of representative corpora gathered in-house.

Conclusion
Applied Data Science at NTENT represents a very broad range of projects (including R&D work) done within multidisciplinary teams including machine learning experts, core programmers, language engineers and ontologists. In many ways, NTENT is uniquely positioned to tackle complex data science problems because its technology incorporates, in a balanced fashion, both in-house and state-of-the-art data/text analytic components. Many of NTENT’s in-house components are a product of years of careful engineering, testing and successful deployment, and we have reached a point of maturity that makes it possible for the company to experiment, adopt and advance new technologies without disrupting its core functionality.