On Thursday, November 10, Joe Hilger and Sara Duane spoke at Text Analytics Forum about identifying secure and confidential information using auto-tagging. Information security continues to grow in importance in today's society. We hear stories all of the time about hackers accessing private information from companies and government agencies. Every organization struggles with employees who store confidential information on insecure network drives or cloud drives. Joe and Sara did a project with a federal research organization that used auto-tagging and text analytics to identify confidential information that needed to be moved to a secure location. During the presentation, we shared the approach we took to identify this information and how we made sure that the tagging and text analytics were accurate. Attendees learned best practices for designing a taxonomy for auto-tagging and tuning auto-tagging as well as ways to identify confidential information across the enterprise.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Identifying Security Risks Using Auto-Tagging and Text Analytics
1. Identifying Security Risks Using
Auto-Tagging & Text Analytics
Text Analytics Forum 2022
Joe Hilger and Sara Duane
2. ENTERPRISE KNOWLEDGE
Outline
EK at a Glance The Problem Our Approach Our Methodology
and Best
Practices
What You Will Learn
⬢ How to identify confidential information across an
enterprise
⬢ Best practices for leveraging and tuning auto-
tagging
⬢ How to design a taxonomy for auto-tagging
3. ⬢ 33 Years of Consulting Experience
⬢ Expert in Knowledge Management and
Knowledge Graph Technologies
⬢ Coauthor of Making KM Clickable (2022)
JOE
CTO AND COFOUNDER, ENTERPRISE KNOWLEDGE
HILGER
SARA
SENIOR TECHNICAL ANALYST, ENTERPRISE KNOWLEDGE
DUANE
⬢ Serves as project manager for technical
implementation and strategy projects
⬢ Conducted complex auto-tagging projects
for clients in both the commercial and federal
space
ENTERPRISE KNOWLEDGE
4. 10AREAS OF EXPERTISE
KM STRATEGY & DESIGN TAXONOMY & ONTOLOGY DESIGN
TECHNOLOGY SOLUTIONS AGILE, DESIGN THINKING, & FACILITATION
CONTENT & BRAND STRATEGY KNOWLEDGE GRAPHS, DATA MODELING, & AI
ENTERPRISE SEARCH INTEGRATED CHANGE MANAGEMENT
ENTERPRISE LEARNING CONTENT MANAGEMENT
80
+
EXPERT
CONSULTANTS
HEADQUARTERED IN WASHINGTON, DC,
USA
ESTABLISHED 2013 – OUR FOUNDERS AND PRINCIPALS HAVE BEEN PROVIDING
KNOWLEDGE MANAGEMENT CONSULTING TO GLOBAL CLIENTS FOR OVER 20 YEARS.
KMWORLD’S
100 COMPANIES THAT MATTER IN KM (2015, 2016, 2017, 2018,
2019, 2020, 2021, 2022)
TOP 50 TRAILBLAZERS IN AI (2020, 2021, 2022)
CIO REVIEW’S
20 MOST PROMISING KM SOLUTION PROVIDERS (2016)
INC MAGAZINE
#2,343 OF THE 5000 FASTEST GROWING COMPANIES (2021)
#2,574 OF THE 5000 FASTEST GROWING COMPANIES (2020)
#2,411 OF THE 5000 FASTEST GROWING COMPANIES (2019)
#1,289 OF THE 5000 FASTEST GROWING COMPANIES (2018)
INC MAGAZINE
BEST WORKPLACES (2018, 2019, 2021, 2022)
WASHINGTONIAN MAGAZINE’S
TOP 50 GREAT PLACES TO WORK (2017)
WASHINGTON BUSINESS JOURNAL’S
BEST PLACES TO WORK (2017, 2018, 2019, 2020)
ARLINGTON ECONOMIC DEVELOPMENT’S
FAST FOUR AWARD – FASTEST GROWING COMPANY (2016)
VIRGINIA CHAMBER OF COMMERCE’S
FANTASTIC 50 AWARD – FASTEST GROWING COMPANY
(2019, 2020)
AWARD-WINNING
CONSULTANCY
PRESENCE IN BRUSSELS, BELGIUM
EK At A Glance
STABLE CLIENT BASE
ENTERPRISE KNOWLEDGE
6. Problem Statement
At this federal research organization, researchers, proposal
authors, project managers, etc. all leverage project content, data,
and documentation on their shared drives.
They need to have a way to:
▪ Identify content that is controlled, CUI, or otherwise
sensitive
So that they can…
Move the relevant documents to a secure location
Prevent data loss and compliance issues
Ensure all documents have a classification
7. How Common Tools Solve the Problem
A lot of tools or solutions would solve this by
looking for PII information through pattern
recognition, including:
⬢ Using regex to identify the patterns behind PII
information, such as a phone number.
⬢ Identifying specific sensitivity labels within
the content itself, such as “top secret.”
These products and solutions don’t look for terms or categories of information
that reflect sensitive content. What if a piece of information within a document
is sensitive, but doesn’t contain the term “top secret” within it nor any identifiable
PII through pattern recognition?
8. Our Solution
Teaching
Technology
Identify the terms, words, and categories of information that
suggest secure information.
Develop a subject-oriented topic taxonomy of secure terms.
Conduct auto-tagging on documents with this subject-oriented
taxonomy to identify the secure content.
Leverage these tags and labels to begin the migration process.
1
2
3
4
9. What is a Taxonomy?
A taxonomy is a controlled vocabulary
used to describe or characterize explicit
concepts of information for the purpose
of capturing, managing, and
presenting.
Taxonomies are often driven by:
● Type of Content
● Medium
● Organization
● Purpose
● Topic (most relevant for our
approach)
11. Building Our Understanding
Conduct focus
groups with staff who
are creators, holders,
or consumers of
content to ensure a
complete
understanding of the
content they work
with and what
constitutes secure
information for them.
Analyze
documentation,
content, and data
that suggests secure
information as well as
documentation
without secure
information to
identify key topics.
Conduct a semantic
analysis of content
that identifies
significant terms
through a machine
learning algorithm
and can validate and
enhance the
designed taxonomy.
Focus Groups Document Review Corpus Analysis
Focus Groups
with Core
Team & SMEs
33+
Documents
Evaluated
287k
For this engagement, EK conducted a
thorough discovery phase:
12. Building the Taxonomy
Study Area Geography Method of Measure
Environment Application Content Type
EK used the field of environmental research to model what could be identified as secure information within a
specific domain.
The terms that made up these taxonomies were identified through focus groups with environmental research
SMEs, as well as four corpus analyses on subsets of relevant content.
The corpus analysis identified and added 37% of the taxonomy terms (i.e., terms and synonyms), thus
enriching the final POC taxonomy.
13. Solution Architecture
Project Solution Architecture
EK leveraged two main tools for this
POC:
o PoolParty: Hosted the taxonomy
and ontology, and via API, auto-
tagged the provided documents.
o GraphDB: Stored the documents
and their applied tags from the
taxonomy and ontology.
To successfully complete this
approach, EK created data pipelines
between the document storage
account, PoolParty, and GraphDB
using UnifiedViews, an ETL tool.
These pipelines facilitated the
necessary data transformation and
integration to power GraphSearch.
14. Visualizing Tags
⬢ EK leveraged PoolParty’s
GraphSearch server to allow
the organization to visualize
the results of the auto-tagging
process.
⬢ Users could filter and search
for documents based on the
identified tags.
⬢ During this phase, we could
visualize and analyze the
accuracy of the tags. View of PoolParty’s GraphSearch
17. Design Best Practices
Remember Your End User: A Machine
Design requirements for a machine are different than for a taxonomy leveraged by a human for
navigation, search, etc.
Granularity Is Important
The taxonomy should reflect the granularity of the content and get into the details of what is
presented in the content.
Synonyms at the Correct Level Are Your Friends
With relevant and accurate synonyms used correctly, auto-tagging can better parse
through the text and recognize what the content is about.
Ensure Taxonomy Terms are Reflective of the Content
The topics of your content items should help form the basis of your taxonomy.
19. Auto-tagging: An advanced application of taxonomy in which terms are automatically
applied to content as tags through text recognition, inheritance, or other automated means.
Basic level:
Searching the text for taxonomy
terms to apply, relying solely on the
term appearing in the content itself.
More complex level:
Using context and machine learning
to tag additional terms that may not
be in the content itself.
1
2
3
4
5
Metadata Inheritance
WHAT IS AUTO-TAGGING?
What Type of Auto-tagging Works for
Your Needs?
Migration Logic
NLP Extractor
ML Classification
Custom NER Models
20. AUTO-TAGGING WITH POOLPARTY
EXTRACTION
Auto-tagging is text extraction with
natural language processing (NLP) and
light machine learning (corpus scoring)
to score extracted concepts by a mix of
frequency, location in the document,
etc.
It’s important to understand both the
taxonomy and the content it will be
used to tag.
Auto-tagging will only tag well the
fields of the taxonomy that are
topical and well matched to the text
of the content items.
Core Components Necessary for Auto-
tagging:
● Synonym-rich taxonomy that is
aligned with the target content
● Taxonomy management tool
● “Learning” corpus capabilities
● Content management system with
target content
● Middle layer that can send content to
be tagged and then store the
suggested tags
Concept Extraction
21. Lemmatization and Stemming
Lemmatization reduces words to their common
base forms:
● am, are, is => be
● car, cars, car’s, cars’ => car
Stemming looks at the root of a word:
● accounts, accounting, accountant -> account
Concept extraction does not require that the exact term from the taxonomy be present in
the text. Techniques like stemming and lemmatization can help increase matches.
Important Note!
Stemming and lemmatization can be risky as they may
obscure real differences in meaning.
22. AUTO-TAGGING WITH POOLPARTY
EXTRACTION
Scoring methods:
● Frequency - the more often a term appears in a document, the higher it scores
● Location boosting - terms found in some locations in a document (for example,
the title), will have their score “boosted,” or weighted higher
● Term Frequency - Inverse Document Frequency (TF-IDF) scoring method
penalizes overly frequent terms and boosts rare terms. The frequency of a term
in a document is balanced against the frequency of that term across a
representative corpus of documents. For example, the most frequently used word
in many English documents is “the” - using TF-IDF scoring, this term will have a
low score
Scoring/Ranking Extraction
24. FINE-TUNING ITERATIVELY
● Blacklist
● Exact match
● Disambiguation
● Ontology
● Shadow concepts
● Corpus adjustment
● TF-IDF scoring
● F-score
Auto-tag
● Blacklist
● Exact match
● Synonyms
● Adjust taxonomy
● Prioritize content
segments (e.g., Title)
● Corpus scoring
Initial Fine-
tuning
Long-term
Fine-tuning
Initial Fine-
tuning
Long-term
Fine-tuning
Evaluate
Accuracy
Iterative Fine-tuning
You will need to conduct
multiple rounds, tweaking
the taxonomy and rules to
best fit the content you are
working with, and
evaluating the accuracy for
each round.
25. HOW TO ASSESS ACCURACY
GOLD STANDARD
ANECDOTAL
ACCURACY
F-SCORES AND IAA
(INTER ANNOTATOR
AGREEMENT)
How to Assess Accuracy
26. Q&A
Thank you for listening.
Questions?
JOE HILGER,
COO and Co-Founder of Enterprise
Knowledge
JHILGER@ENTERPRISE-KNOWLEDGE.COM
WWW.LINKEDIN.COM/IN/JOSEPH-HILGER/
SARA DUANE,
Senior Technical Analyst
SDUANE@ENTERPRISE-KNOWLEDGE.COM
WWW.LINKEDIN.COM/IN/SARA-DUANE/