Text categorisation that works linguistically by classifying verbs, nouns, adjectives… in context, not just isolated keywords
The categorisation service is built using our NaturalExtractor technology. It classifies text into different categories according to a predefined taxonomy.
A typical example of categorisation in this domain would be:
- The concepts “screen”, “case”, “cover”, “camera”, “battery” all belong to the PRODUCT category as nouns only. Sentences like “I love the screen on my new Kindle Fire” or “I’ve bought a great new cover for my iPad” would be classifiedas belonging to the PRODUCT category but sentences like “I hate it when they screen my iPad at security” or “I hope they’re going to cover the new Galaxy Tab in next week’s review” do not.
For a reliable categorisation process, our service first uses Deep Linguistic Analysis to detect entities, concepts and verbs (e.g. “Barack Obama”, “global warming”, “increase in prices”, “took off”). The linguistic representation of the text is then checked against a dictionary that stores the taxonomy. When a word or phrase in the text corresponds to a dictionary entry, the category for that entry is assigned to the text.
We can also help you bootstrap your dictionary creation process for any domain
This process is based on the meaning of the words used and does not rely on simple keyword matching. Linguistic variations that change the forms of words but do not alter their core meaning are handled correctly. This includes linguistic phenomena such as morphological variation (different forms of a verb according to mood, tense, gender, number and person) and syntactic rules such as phrasal verbs (“Apple takes lead over Google”, “Apple takes over Italian software start-up”).
The categorisation service works with a user-supplied taxonomy, but often there is no pre-existing dictionary or thesaurus of categories that can be easily integrated. In that case we have a simple solution for dramatically reducing the time and cost of creating one. Our concept and entity extraction services can be used to analyse documents belonging to the target domain in order to boot-strap the taxonomy building process. By extracting the most relevant concepts, entities and verbs from a corpus of documents the process of assigning words to categories can be significantly reduced.