AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a flexible set of tools and services Holger Keibel (Karakun, Switzerland) Elisabeth Maier (Karakun, Switzerland)
Customers interested in Language Analytics solutions typically approach us with a broad range of business cases and specific business needs. Especially when it comes to the data available for their case and for any AI aspects involved, the variation in data types, data quality and data quantity is, by our experience, quite vast and at the same time so critical for a project's success, that we often start our requirements analysis right there: at the data. At Karakun, our Language Analytics team addresses this in an increasingly flexible way: We select from a set of Language Analytics tools and related services (e.g. data cleansing and data procurement) to meet the business needs at hand with the data available or at least in reach – at reasonable costs.
The methodology stack ranges from heuristic logic over statistical solutions to neural networks. At the same time, we aim at reducing the amount of data needed for such training, e.g. by integrating state-of-the-art neural technologies into our platform. That way, also SMEs and their specific business cases can benefit from the full range of Language Analytics options.
To illustrate our approach, we will present an e-Safe solution which allows for semantic document tagging and search in highly secured virtual safes. In addition, our solution provides text-based triggers for complex workflows depending on the safe´s content.
Similar to AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a flexible set of tools and services Holger Keibel (Karakun, Switzerland) Elisabeth Maier (Karakun, Switzerland)
Similar to AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a flexible set of tools and services Holger Keibel (Karakun, Switzerland) Elisabeth Maier (Karakun, Switzerland) (20)
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a flexible set of tools and services Holger Keibel (Karakun, Switzerland) Elisabeth Maier (Karakun, Switzerland)
1. www.karakun.com
Bringing AI to SME projects:
Addressing customer needs with a
flexible set of tools and services
Holger Keibel
Elisabeth Maier
AI-SDV 2020
2. 2
Background
• Karakun AG (Basel, 50 employees)
• Builds custom software where no standard
solution exists on the market
• Uses open-source components where possible
• Offers software platforms
to boost development efficiency, e.g.
HIBU platform offering pre-built functionalities
for solutions around Enterprise Search,
Language Analytics, and AI
3. 3
Our customers’ most frequent AI needs
Text classification
• Assign categories
to texts
• Predefined set of
categories
Information extraction
• Identify within a text
relevant pieces of
information
• Entities, keywords,
values etc.
Topic identification
• Assign a label to a
text, summarizing its
main topic
• Generally use terms
found in the text
5. 5
Custom classifiers & extractors
• Fine-tune built-in classifier/extractor to customer’s domain
• Extend built-in classifier/extractor by additional categories/information types
• Create new classifier/extractor for custom set of categories/information types
• Assign editorial content to newsletters
• E-mail triage
• Recognize tax-relevant documents / specific contract types / ..
• Recognize country-specific payment slips and extract relevant data
• …
6. 6
(Supervised) Learning
• Statistical: SVMs, Naive Bayes,
decision trees
• Neural networks (deep learning)
AI choices for custom classifiers/extractors
Rule-based
• Regular expressions
• Ontologies / terminologies
7. 7
Cost factors and quality aspects
Rule-based Supervised learning
Required data volume low high
Required data quality rather low high
Initial ramp-up costs rather high rather high
Maintenance costs high moderate
Costs of scaling system to new
domains, applications and languages
(→ time to market)
high moderate
Sensitive to context low high
Recall (→ false negatives) low (1) high
8. 8
Training data for supervised learning
Rule of thumb (up until recently):
To train a document classifier with N target categories,
it requires training documents in the order of 10,000*N.
→ For SMEs, suitable and sufficient training data are …
• In general: not readily available
• Costly to procure
• Investments generally don’t pay off for SMEs’ business cases
9. 9
Examples from previous projects
Classification task Training data
Assign editorial content to
newsletters (finance)
Large number readily available:
all articles from past newsletters
Extract key data from invoices
None available;
Generation of synthetic data not suitable here
Detect whether a message
talks about adverse effects of a
medication
Hardly any existed;
Collected some by web search (medications & known adverse effects);
But highly biased: missing unknown adverse effects
10. 10
Customer project by DSwiss:
Encrypted digital safes
• Users can upload any type of document
• Classifier and extractors used for
search filters
• Frequently need to extend to
new categories and languages
• But:
• Classifier is rule-based
• Difficult to obtain large amount of
suitable training data
11. 11
Our approach in previous projects
• Assess classification/extraction task
• Inspect relevant data that are readily available
• Do our built-in classifiers/extractors suffice?
• If new classifier/extractor is needed, consider all approaches:
Rule-based:
Sometimes the best
choice
Statistical:
Often good choice if
decent amount of
training data available
and features can be
engineered efficiently
Neural:
In practice rarely used
in specific customer
projects – not enough
data to get advantage
over statistical
12. 12
A true game-changer:
Pre-trained language models
• Based on the Transformer architecture (e.g. BERT, GPT-2/GPT-3)
• Pre-training model with prediction tasks
• On massive data
• Using only plain text → self-supervised learning
• Build up rich contextualized representations of words
vs. non-contextual word embeddings (word2vec and GloVe)
• Fine-tuning model to a target task
• Transfer learning
14. 14
Acceptable performance level
Fine-tuning: much less data needed
Training samples (log)
Fine-tuning pre-trained system
Training a network from scratch
Performance after training
15. 15
Advantages of using BERT
• Re-use architecture and trained model
• Only need to replace output layer to task-specific layer
• Significantly fewer training data needed
→ For SMEs, suitable and sufficient training data are …
• In many cases feasible to procure
• Investments do pay off for SMEs’ business cases
16. 16
Pre-trained
moderate
moderate
rather high ??
moderate
moderate
high
high
Cost factors and quality aspects
Rule-based Supervised learning
Required data volume low high
Required data quality rather low high
Initial ramp-up costs rather high rather high
Maintenance costs high moderate
Costs of scaling system to new
domains, applications and languages
(→ time to market)
high moderate
Sensitive to context low high
Recall (→ false negatives) low (1) high
17. 17
Joint research project
• Partners: SUPSI (Lugano) and DSwiss (Zurich)
• Co-funded by Innosuisse
Goals:
• Create core classifiers and extractors by fine-tuning BERT
• Increase coverage of document types
• Improve performance
• Extend to new tasks, e.g.
• Extract data from invoices
• Extract data from ID cards