Good morning/afternoon/evening,
I work for an Insurance company and i'm running into the following problem:
I'm trying to standardise 'free-text clauses' in our insurance system, clauses for insurance products which, for one reason or another, were not entered using a standard clause.
The difficulty here is that people have been using (slightly) different text to describe similar problems, or (slightly) similar text to descrbe different problems for over 20 years now.
i want to analyse my list of cells containing text (some 17000 rows).
I aim to find out:
- how to group the texts by similarity, and how to adjust the definition of 'similarity', E.g. 60% similar, 70% similar etc.
- how to ascertain percentages of similar text, e.g. 120 groups of similar text, group one comprising 4% of the data, group 2 comprising 6% etc.
- how to group the data based on key words, e.g. group all text clauses which contain words 'X', 'Y', 'Z'.
- how to remove certain words from the formulas used for above so as not to include certain phrases or words when calculating similiraty like 'and', 'the client has indicated' etc.
Due to the sensitive nature of the data i cannot post any examples of the data i am working with.
Any and all tips will be greatly appreciated, thanks in advance.
I work for an Insurance company and i'm running into the following problem:
I'm trying to standardise 'free-text clauses' in our insurance system, clauses for insurance products which, for one reason or another, were not entered using a standard clause.
The difficulty here is that people have been using (slightly) different text to describe similar problems, or (slightly) similar text to descrbe different problems for over 20 years now.
i want to analyse my list of cells containing text (some 17000 rows).
I aim to find out:
- how to group the texts by similarity, and how to adjust the definition of 'similarity', E.g. 60% similar, 70% similar etc.
- how to ascertain percentages of similar text, e.g. 120 groups of similar text, group one comprising 4% of the data, group 2 comprising 6% etc.
- how to group the data based on key words, e.g. group all text clauses which contain words 'X', 'Y', 'Z'.
- how to remove certain words from the formulas used for above so as not to include certain phrases or words when calculating similiraty like 'and', 'the client has indicated' etc.
Due to the sensitive nature of the data i cannot post any examples of the data i am working with.
Any and all tips will be greatly appreciated, thanks in advance.