A brief introduction on how we use syntactic and semantic analysis to extract relevant and useful information from large bodies of unstructured data – no matter if it is written in German, English, French or Italian.
Given the text snippet above, we will now illustrate the process of text mining and semantic analysis.
The text is first segmented into linguistic units such as words, numbers or punctuation.
We assign a word class (part-of-speech) to all words and other kinds of tokens in the text. Parts of speech include nouns (N), verbs (V), adjectives (ADJ), adverbs (ADV), conjunctions (CNJ), and so on.
Based on the previously assigned part-of-speech tags, we determine the lemma (base form) for each word. This way, we abstract from inflected forms and are able to resolve ambiguities.
Using regular expressions, we are able to recognize units such as email addresses, URLs, date and time specifications, numbers or monetary expressions.
Not only do we recognize Multi-Word Expressions (MWEs) stored in our Knowledge Base, but we also employ machine learning methods to find significant MWEs dynamically from document sets.
Relevant keywords – including multi-word expressions – are identified using statistical analysis based on the whole text corpus (document collection).
Using machine learning techniques, we locate and classify named entities in text into classes such as persons, locations, organizations and companies.
Our Knowledge Base provides over 4.3 million locations, 3.3 million persons, 520,000 companies & organizations which are used as disambiguation candidates for the previously identified named entity mentions.
Ambiguities are resolved with high precision using ontological relations and context information from our Knowledge Base. The remaining entity mentions are clustered as unknown entities.
Resulting entities are enriched with information by our Knowledge Base, ranging from ontological relations over short abstract texts up to related images. All information may be augmented with customer specific content.