'Classifiers assembled with identical training sets using IBM Watson NLU and IBM Watson NLC services yield different results
Everyone actively using the Natural Language Classifier service from IBM Watson has seen the following message while using the API:
"On 9 August 2021, IBM announced the deprecation of the Natural Language Classifier service. The service will no longer be available from 8 August 2022. As of 9 September 2021, you will not be able to create new instances. Existing instances will be supported until 8 August 2022. Any instance that still exists on that date will be deleted. For more information, see IBM Cloud Docs"
IBM actively promotes to migrate NLC models to IBM's Natural Language Understanding Service. Today I have migrated my first classification model from Natural Language Classifier to Natural Language Understanding. Since I did not dive into the technological background of either service, I wanted to compare the output of both services. In order to do so, I followed the migration guidelines provided by IBM ( NLC --> NLU migration guidelines ). To recreate the NLC classifier in NLU, I downloaded the complete set of training data used to create the initial classifier built in the NLC service. So the data sets used to train the NLC and NLU classifiers are identical. Recreation of the classifier in NLU was straightforward forward and the classifier training took about the same time as in NLC.
To compare the performance, I then assembled a test set of phrases that was not used for training purposes in either the NLC or NLU service. The test set contains 100 phrases that were passed through both the NLC and NLU classifier. To my big surprise, the differences are substantial. Out of 100, 18 results are different (more than 0.30 difference in confidence value), or 37 out of 100 when accepting a difference of 0.2 in confidence value. To summarize, the differences in analysis results are substantial.
In my opinion, this difference is too large to blindly move on to migrating all NLC models to NLU without any hesitation. The results I obtained so far justify further investigation using a manual curation step by a SME to validate the yielded analysis results. I am not too happy about this. I was wondering whether more users have seen this issue and/or have the same observation. Perhaps someone can shed a light on the differences in analysis results between the NLC and NLU services. And how to close the gap between the differences in analysis results obtained with the NLC and NLU service.
Please find below an excerpt of the analysis results of comparison:
title | NLC | NLU | Comparability |
---|---|---|---|
"Microbial Volatile Organic Compound (VOC)-Driven Dissolution and Surface Modification of Phosphorus-Containing Soil Minerals for Plant Nutrition: An Indirect Route for VOC-Based Plant-Microbe Communications" | 0,01 | 0,05 | comparable |
"Valorization of kiwi agricultural waste and industry by-products by recovering bioactive compounds and applications as food additives: A circular economy model" | 0,01 | 0,05 | comparable |
"Quantitatively unravelling the effect of altitude of cultivation on the volatiles fingerprint of wheat by a chemometric approach" | 0,70 | 0,39 | different |
"Identification of volatile biomarkers for high-throughput sensing of soft rot and Pythium leak diseases in stored potatoes" | 0,01 | 0,33 | different |
"Impact of Electrolyzed Water on the Microbial Spoilage Profile of Piedmontese Steak Tartare" | 0,08 | 0,50 | different |
"Review on factors affecting Coffee Volatiles: From Seed to Cup" | 0,67 | 0,90 | different |
"Chemometric analysis of the volatile profile in peduncles of cashew clones and its correlation with sensory attributes" | 0,79 | 0,98 | comparable |
"Surface-enhanced Raman scattering sensors for biomedical and molecular detection applications in space" | 0,00 | 0,00 | comparable |
"Understanding the flavor signature of the rice grown in different regions of China via metabolite profiling" | 0,26 | 0,70 | different |
"Nutritional composition, antioxidant activity, volatile compounds, and stability properties of sweet potato residues fermented with selected lactic acid bacteria and bifidobacteria" | 0,77 | 0,87 | comparable |
Solution 1:[1]
We have also been migrating our classifiers from NLC to NLU and doing analysis to explain the differences. We explored different possible factors to see what may have an influence: Upper case/Lower case, text length…no correlation found in these cases.
We did however find some correlation between the difference in score between the 1st and 2nd class returned by NLU and the score drop from NLC. That is to say we noticed that the closer the score of the second class returned the lower the NLU score on the first class. We call this confusion. In the case of our data there are times when the confusion is ‘real’ (ie. an SME would also classify the test phrase as borderline between 2 classes) but there were also times when we realized we could improve our training data to have more ‘distinct’ classes.
Bottom line, we can not explain the internals of NLU that generate the difference and we do still have a drop in the scores between NLC and NLU but it is across the board. We will move ahead to NLU despite the lowering of the scores: it does not hinder our interpretation of results.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Pmag |