TEXT MINING & SEMANTICS

TECHNOLOGY

Text Mining & Semantics

Text Mining & Semantics

A brief introduction on how we use syntactic and semantic analysis to extract relevant and useful information from large bodies of unstructured data – no matter if it is written in German, English, French or Italian.

STEP 01:  CONTENT PROCESSING  |  EXAMPLE

Example

Example

Given the text snippet above, we will now illustrate the process of text mining and semantic analysis.

STEP 02:  CONTENT PROCESSING  |  TOKENIZATION

Tokenization

Tokenization

The text is first segmented into linguistic units such as words, numbers or punctuation.

STEP 03:  CONTENT PROCESSING  |  PART-OF-SPEECH TAGGING

Part-of-Speech Tagging

Part-of-Speech Tagging

We assign a word class (part-of-speech) to all words and other kinds of tokens in the text. Parts of speech include nouns (N), verbs (V), adjectives (ADJ), adverbs (ADV), conjunctions (CNJ), and so on.

STEP 04:  CONTENT PROCESSING  |  LEMMATIZATION

Lemmatization

Lemmatization

Based on the previously assigned part-of-speech tags, we determine the lemma (base form) for each word. This way, we abstract from inflected forms and are able to resolve ambiguities.

STEP 05:  CONTENT PROCESSING  |  PATTERN MATCHING

Pattern Matching

Pattern Matching

Using regular expressions, we are able to recognize units such as email addresses, URLs, date and time specifications, numbers or monetary expressions.

STEP 06:  CONTENT PROCESSING  |  MULTI-WORD EXPRESSION IDENTIFICATION

Multi-Word Expression Identification

Multi-Word Expression Identification

Not only do we recognize Multi-Word Expressions (MWEs) stored in our Knowledge Base, but we also employ machine learning methods to find significant MWEs dynamically from document sets.

STEP 07:  CONTENT PROCESSING  |  KEYWORD ANALYSIS

Keyword Analysis

Keyword Analysis

Relevant keywords – including multi-word expressions – are identified using statistical analysis based on the whole text corpus (document collection).

STEP 08:  CONTENT PROCESSING  |  NAMED ENTITY RECOGNITION

Named Entity Recognition

Named Entity Recognition

Using machine learning techniques, we locate and classify named entities in text into classes such as persons, locations, organizations and companies.

STEP 09:  CONTENT ENRICHMENT  |  CANDIDATE IDENTIFICATION

Candidate identification

Candidate Identification

Our Knowledge Base provides over 4.3 million locations, 3.3 million persons, 520,000 companies & organizations which are used as disambiguation candidates for the previously identified named entity mentions.

STEP 10:  CONTENT ENRICHMENT  |  DISAMBIGUATION & KNOWN RELATIONS

Disambiguation & known relations

Disambiguation & known relations

Ambiguities are resolved with high precision using ontological relations and context information from our Knowledge Base. The remaining entity mentions are clustered as unknown entities.

STEP 11:  CONTENT ENRICHMENT  |  KNOWLEDGE-BASED CONTENT ENRICHMENT

Knowledge based Content Enrichment

Knowledge-based Content Enrichment

Resulting entities are enriched with information by our Knowledge Base, ranging from ontological relations over short abstract texts up to related images. All information may be augmented with customer specific content.