SlideShare a Scribd company logo
1 of 34
Changing the Way the Financial World
Processes & Utilizes Information
Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Introduction
Speaker: Samuel Berger
Topic: Data architecture (i.e., normalization / relational algebra) and
database security
Description:
This presentation underscores the importance of creating precise data
structures when handling, processing and manipulating mass amounts
of data. As data has become key in the operations of virtually all major
companies around the world, having the data easily maintained and
utilized is pivotal. Companies often live or die in today’s hyper-
competitive business climate by their ability to advantageously
manipulate their data. It is therefore paramount that this enterprise-
critical data is housed in well-organized structures that are intuitive for
developers to work on. The bulk of this presentation offers tips and
examples on how as well as the numerical benefits using a large data
example.
1Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Background
I am a Fintech entrepreneur and developer as well as a data scientist
having worked mostly in financial mass data projects since 1989. I
started in these fields with SBIC, using technology and massive
amounts of data to predict the world’s largest financial market –
FOREX. My systems earned my clients (Daiwa Securities, Bank of
Montreal, Julius Bär Group Ltd., Société Générale, royalty and
national treasuries, to name a few) returns of over 18% per annum
non-compounded over the 5 ½ years we traded. At peak my
company traded the equivalent of over $1 billion in a day. Some of
my other projects included: working on E*Trade’s E*Advisor
system, VeriSign, SGI, two industry-founding VOIP unified
messaging companies, and Enterprise Architect for Capital Group
Companies (managed over $1.3 trillion at the time). I am currently
working on a large project for CLEAR.
2Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
3
Discussion Points
I. Relational Algebra / Normalization Familiarity
II. Brief History
III. 1st Normal Form Simply Stated
IV. Best Practices
V. Practical Examples
VI. Key Structures
VII. Performance and Data Space / Maintenance
VIII. Overloaded Domains (Columns)
IX. Theory Modified Slightly by Practice
X. Locking
XI. Normalization Conclusion
XII. Securing the Data Layer
XIII. Problems with Outsourcing IT
XIV. Data Theft – Primary Weaknesses
XV. Q & A
Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Relational Algebra
Database Normalization
Have you worked on a database that was in at least
the First Normal Form (1NF)?
Does anyone know at what point in Normalization
duplicates are no longer allowed?
Does anyone know at what point NULLs are no
longer allowed?
4Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Brief History
Relational Algebra was primarily developed by Edger F. Codd from 1969
to 1973, and primarily documented by Chris Date, both IBM employees.
Codd also created his “12 rules” (really 13 as he started from zero) that
were used to define the qualifications of a relational database management
system (RDBMS).
Codd’s work heavily influenced IBM’s first RDBMS called System R
back in 1973. System R was created by Ray Boyce and Don Chamberlin.
System R introduced the Standard Query Language (SQL), originally
called SEQUEL while in development, hence the reason we still refer to
SQL Server as “Sequal” Server.
Codd and Boyce later teamed up to create the Boyce-Codd Normal Form,
which is one step more confined than the 3rd Normal Form.
5Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
6
1st Normal Form (1NF)
My goal here is not to confuse but to simplify
Key principles:
1) Each row must have at least one unique key also referred to as a
Candidate Key (i.e., no duplicate rows). A Candidate Key is the minimum column
grouping on a table to create a unique record. Columns that do not help to define
uniqueness are attributes of the Candidate Key or Candidate Keys as the case may be
should more than one unique column set exist.
2) Every row column intersection must have a value.
3) Every row column intersection can only contain one value, not a list of
values.
4) Every row column intersection must have a valid value from the pool of
potential valid values (i.e., a plane parts table cannot have a column for engine parts and
then enter into it both engine parts as well as plane max speeds).
5) The functionality of the table is not dependent on the order of the data
with respect to the order of rows or columns (i.e., querying the data will determine the
column order and the row order of the output).
Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Best Practices
Modern database terminology uses the term Primary Key as a binder
of the data more than as a concept of a unique row identifier based on
data properties. As such the Primary Key is now a separate concept
from the Primary Candidate Key. It should always be an auto-growing
integer starting sequentially from row 1, and the server prefers that it
is the first column. My naming convention is the table name plus
“_ID”.
Table and column names should always be descriptive even if verbose.
Mistakes occur most commonly due to lack of understanding of the
data model and the purposes of each container. Never use database
keyword names for column names (i.e., name is a keyword as is
filename).
Data should be related for the data’s sake and not for the current
application requirements. Requirements change, if the data is
structured accurately then the data model will remain accurate.
7Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Audit Columns
Also a component of best practices is to include audit columns to cover
the following key pieces of data:
1) The date and time the record was created.
2) The person ID or process name that created the record.
3) The date and time the record was last updated.
4) The person ID or process name that made the update.
5) Update count – a truly critical column for every table
that will be covered a bit later in this discussion.
6) A column indicating whether the record is currently
active.
Note: In blockchain projects records are never updated, only added.
Also, additional information must be included to indicate whether or not
the data is in sync with its partners.
8Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Practical Example
The following example is designed to help illustrate normalization.
For this example I created a database schema for how I would build
the Microsoft Explorer application from scratch.
I believe everyone here has used Microsoft Explorer and can fully
visualize this exercise and see some of the power of the Normalized
architectural design.
This small database meets the requirements of the Boyce-Codd
Normal Form.
NOTES: This is not the actual Microsoft Explorer data model. This is
how I would design it. Also, I added silver keys to denote the
Candidate Key to each table. Most tables have only one Candidate
Key. For those with multiple I only used one to simplify the model
for easier conceptual understanding. As such all silver keys on a table
are used to create a single Candidate Key for each table. Lastly, audit
columns do not apply to this simple application.
9Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Explorer Data Model
10Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Key Database Structure
11Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Second Example
“Persons”
This example is far more interesting. I have taken a
very common (maybe the single most common)
database architecture and restructured it using
relational algebra properly. It should be noted I have
never seen this structure used anywhere in the world,
much to my surprise. I do use it in CLEAR in
multiple locations (i.e., not just with the Persons
information). It is the correct usage of the
mathematics and has massive benefits when applied
to very large data sets that will be covered
numerically.
12Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Data and Architecture Notes
1) Loaded just over 265 million records into both structures to give
proper time and size comparisons in a big data environment.
2) As there are duplicate records with respect to first, middle and last
names along with birthdays, no candidate key is possible in the
traditional structure.
3) I used the 2010 US Census data for a list of last names and their
frequency. They only included non-concatenated last names, as such I
had to create my own concatenated examples.
4) I used a list of 2016 baby names in Scotland as it was the largest first
name database that I could locate with a breakout between male and
female names.
5) I randomly generated the first and middle names from this list of
known names. I also randomly generated the order to enter the names
into both table schemas to prevent bias.
6) I did not add the audit columns or display the field specifications for
conceptual simplicity.
13Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Typical Persons Table
14Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Notes:
1) No way to properly identify a candidate key in spite of important
defining data.
2) MiddleName and Suffix will have to allow NULL or absent values.
3) Multiple middle names almost never supported.
Normalized Persons Table Structure
15Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
1st Key Database Structure
16Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
17
2nd Key Database Structure
17Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Statistics – Traditional vs. Normalized
Traditional data structure
Total data size that requires indexing for increased performance:
19.9 gigabytes – also requires multiple columns to be indexed
and most likely multiple indexes
Time to count number of people born on May 5 of any year:
3 minutes 57.4 seconds
Total record count:
729,836
Time to return the ID for 1 person given first, last and middle names
plus birthdate (without indexing):
50.4 seconds
18Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Statistics – Traditional vs. Normalized
(Continued)
Normalized data structures
Total data size that would require indexes for increased performance:
5.2 megabytes – only requires a single column to be indexed
Time to count number of people born on May 5:
1.5 seconds
Total records:
729,836
Time to return the ID for 1 person given first, last and middle names,
and birthdate (without indexing):
1.2 seconds
19Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Statistics – Traditional vs. Normalized
(Continued)
Difference
Data requiring indexing: 3,943.3 times the data!
Time difference to retrieve information by date (not a column
that can be added by an index – so it is what it is): 153.9 times
faster!
Without indexing either table structure – time to retrieve a
single record by customer specific data: 43.1 times faster!
Conclusion: Normalization is always faster and massively more
efficient with respect to data maintenance within a production
transaction environment.
20Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
21
Overloaded Data Column
21Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
The First Normal Form requires each column to be a domain A domain is a
column that contains data from the “pool of legal values”. Legal values for
a ZipCode field are all known zip codes, not, for example, a street name.
Columns that contain more than one informational piece are referred to as
“overloaded”. It can be accurately argued that the datetime data type is the
most commonly used within databases and is an overloaded column.
In Microsoft SQL Server the function
“DATEPART” allows the following retrievals:
year weekday nanosecond
quarter hour TZoffset
month minute ISO_WEEK
dayofyear second
day millisecond
week microsecond
weekday nonosecond
22
Multiple Candidate Keys
Boyce-Codd Normal Form Slight Flaw
22Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
This rule is 99% true. The exceptions
primarily are with reference data as the
difficulties maintaining data are moot at
best. The DateInfo table to the right is an
excellent example as the table data will
never change. I typically add records to
support multiple centuries (36,525 days per
century – tiny amount of data for a table to
support).
A table must be in third normal form (3NF). Additionally, every domain
that determines the value of another domain must be in part or in whole a key
(Candidate Key) that has no overlapping domains with another key.
23
Database Table Locking
The Power of Logical Level Locking
23Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
1) In most RDBMS the default table locking for select
statements is a shared lock.
2) Shared locks easily escalate in high volume production
environments leading to poor performance and deadlocks.
3) In the vast majority of cases the lock proves unnecessary.
4) In almost all cases the locks can be completely avoided
without creating concurrency issues by using an
UpdateCount field, NOLOCK (or equivalent table locking
hint) when selecting data and logic checks while
conducting updates, inserts and deletes from the previously
selected data.
5) In my experience the difference in high volume production
environments is in almost all cases massive.
24
Normalization Conclusion
24Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
1) Proper normalization of the data model can save companies working on
big data or high production transaction databases tens of millions in
hardware and maintenance expenses over the life of a company.
2) Performance in a production environment will always be more reliable
and significantly faster using relational algebra.
3) No one claiming to be a professional database architect can make that
claim without being proficient in relational algebra.
4) Even with data warehousing some data should always be normalized for
maximum performance and flexibility. A good example is the date
information, which can be used to rapidly slice and dice denormalized
data marts efficiently for maximum flexibility with the data.
5) The database can make or break almost all projects. Proper database
design, locking schema and efficient database code is always essential.
25
Securing the Data Layer
25Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Once past the network security layer, which is often far more geared to
protecting against outside intrusion, hackers often experience little to no real
impediments to gaining access and control of the database servers.
Protection should be at all layers with equal and extreme diligence.
Aside from the common data protection deterrents, I have listed how to
properly add security that, I believe, will give even the NSA ulcers if they
should try to hack.
Please note, just as there is a large gap between technology available and
technology applied, there is also a large gap between known best practices
and those practices actually applied. In most environments, big and small,
developers, IT personnel and even sometimes executives want, and usually
get, a back door entry into the production databases.
26
Securing the Data Layer (continued)
26Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Recommendations:
1) Change the default port setting to an obscure port. Strangely, in my entire career I have
never come to a company and seen any of their servers not functioning on the standard
ports. Completely unnecessary. All ports outside of the random one chosen for DB
server should be closed and the DB server port restricted to the DB application.
2) Deny data reader and data writer to all logins. Do not allow any login to the DB servers
to have access to anything but executing stored procedures. No ad hoc querying or
dynamic SQL allowed! Direct access to the data circumvents all business rules and
allows direct access from the users to your data. Very bad practice and poor security.
3) Use a multi-Unicode 30+ character password for the database server system
administrator account.
4) Deny all access by local administrators to the database layer.
5) Use a multi-Unicode 30+ character password for the NT Administrator account.
6) Disable all local administrators.
7) Use at least 3 multi-Unicode characters in all of the stored procedure names (i.e.,
characters from other languages). According to Wikipedia Unicode currently contains
136,755 distinct characters. Many are not allowed within SQL, but still the difference
in combinations just in a 5 character name is staggering!
27
Securing the Data Layer (continued)
27Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
8) Create an application for administrators that will open a port up and enable their OS
administrator logins for a limited amount of time. That process needs to keep an
audit trail that includes IP address and machine mac address. Mac addresses should
be pre-authorized.
9) Alternate between logins (at least three) every minute changing the password to each
every minute to a new, but calculable, password. Retain each password for three
minutes to allow overlapping. Again, passwords should be long and multi-Unicode.
10) Production data is never to be shared with employees no matter what title they have
or how much they complain. If they need information then a report can be developed
for them that properly follows Sarbanes-Oxley (SOX) requirements and is well vetted
and approved.
If these standards are followed the production data will be secured. Programmers and IT
personnel may complain, but they are not being paid large salaries to do easy work. Their
work is first and foremost to protect the key company data, the integrity and privacy of
client data, and make sure the companies’ products and services are highly available and
dependable.
28
Problems with Outsourcing IT
28Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
The company accountants always like to look for ways to reduce expenses.
The problem is at what cost. Some costs cannot be measured in mere P & L.
One such cost is the proliferation of key company information and
intellectual properties. The following are some of the dangers that occur
when outsourcing:
1) Access to servers are granted with administration rights to persons
selected by the outsourced company. No vetting, not even a list of those
given access and their backgrounds, credentials, criminal history –
anything at all…
2) Much of the company data is often accessible. That includes backups,
company personnel information, contracts, bids, clients, data, inventions,
etc.
29
Problems with Outsourcing IT
(Continued)
29Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
3) Outsourced IT companies often outsource themselves to companies
abroad. Your data is then accessible to persons unknown in India,
Pakistan and China to name a few. Those persons are completely
unknown to you and beyond any legal restrictions of the United States.
4) Much has been stated by the government and the news about China and
other countries hacking and stealing our key data. I propose they are not
stealing it as much as we are giving it to them and, to add insult to injury,
we are actually paying them to take it.
Some expenses simply make sense and are a cost of doing business in this
day and age. IT is one of those key mission critical expenses. Live with it…
Never lose control of your company’s life’s blood.
30
Data Theft – Primary Weaknesses
30Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
1) Approximately half of all data theft incidences are from employees.
2) That does not count those that you have given access via outsourcing
your IT.
3) Most large data breaches occur from employees stealing entire backups
and having access to large file data stores of documents and company
critical information.
4) Almost all of Julian Assange’s information that he publishes via
WikiLeaks come from employee data theft. Primarily stolen backup
tapes.
5) In 98% of companies less than half of the data tape backup files are
encrypted according to surveys of IT professionals.
31
Data Theft – Primary Weaknesses
(Continued)
31Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
6) Symptoms of data theft from source with internal company granted access:
a) Size of the amount of data: hackers want to be quick and are naturally
worried about being caught, so they filter their data searches to find the
critical documents or data rapidly. Those working within have no such
time constraints and tend to be far less skilled, so searching is left to those
receiving the data from the thief.
b) Breadth of the data theft: hackers are focused on who they want files from
again limited by time. Those working within tend to get all users’ data.
c) Scope of the data: Once in the hacker will look around as quickly as
possible and attempt to gain information from multiple points within the
network. Employees tend to get all of one type of data that they are
focused on. Usually pertaining to what they are specifically working on,
have specific granted permissions to access, and have an issue.
32
Data Theft – Primary Weaknesses
(Continued)
32Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
Profiling data theft can assist when protecting your data. Know your
data and the type of interests that will want to take, change or distribute your
key company information.
A high profile theft that took place last year with the DNC fits the
symptoms more of an internal data theft than external. The fact that the FBI
was refused access and simply, and unprofessionally, accepted the DNC’s
word for who hacked them is not surprising as the FBI is way behind with
respect to investigating and prosecuting data theft. Do not anticipate the
government to protect your data anytime for the foreseeable future. Or for
that matter to properly investigate or prosecute those guilty in its theft. It is
on you!
33
Changing the Way the Financial World
Processes & Utilizes Information
Thank You
Samuel Berger
Chief Information Officer
(805) 701-0761
sberger@ClearFinTech.com
Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
George Sterling Harris
Executive Vice President
(310) 295-7524
gharris@ClearFinTech.com

More Related Content

Similar to Data Architecture (i.e., normalization / relational algebra) and Database Security

Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
Suppose you are the information technology (IT) manager for an IT
Suppose you are the information technology (IT) manager for an IT Suppose you are the information technology (IT) manager for an IT
Suppose you are the information technology (IT) manager for an IT lisandrai1k
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Dreamforce07
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleVasu S
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringRy Walker
 
Provenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine LearningProvenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine LearningAnand Sampat
 
Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...IQ Online Training
 

Similar to Data Architecture (i.e., normalization / relational algebra) and Database Security (15)

Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
Suppose you are the information technology (IT) manager for an IT
Suppose you are the information technology (IT) manager for an IT Suppose you are the information technology (IT) manager for an IT
Suppose you are the information technology (IT) manager for an IT
 
Data models
Data modelsData models
Data models
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Databases By ZAK
Databases By ZAKDatabases By ZAK
Databases By ZAK
 
Database Project
Database ProjectDatabase Project
Database Project
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | Qubole
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
DBMS Class Presentation for English Version.
DBMS Class Presentation for English Version.DBMS Class Presentation for English Version.
DBMS Class Presentation for English Version.
 
Database System.pptx
Database System.pptxDatabase System.pptx
Database System.pptx
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Provenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine LearningProvenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine Learning
 
Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...
 

More from IDEAS - Int'l Data Engineering and Science Association

More from IDEAS - Int'l Data Engineering and Science Association (20)

How to deliver effective data science projects
How to deliver effective data science projectsHow to deliver effective data science projects
How to deliver effective data science projects
 
Digital cracks in banking--Sid Nandi
Digital cracks in banking--Sid NandiDigital cracks in banking--Sid Nandi
Digital cracks in banking--Sid Nandi
 
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
 
Battling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial IntelligenceBattling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial Intelligence
 
Implementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big DataImplementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big Data
 
Blockchain Application in Real Estate Transactions
Blockchain Application in Real Estate TransactionsBlockchain Application in Real Estate Transactions
Blockchain Application in Real Estate Transactions
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
Practical Machine Learning at Work
Practical Machine Learning at WorkPractical Machine Learning at Work
Practical Machine Learning at Work
 
Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.
 
Operationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced AnalyticsOperationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced Analytics
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Best Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and AcademiaBest Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and Academia
 
Everything You Wish You Knew About Search
Everything You Wish You Knew About SearchEverything You Wish You Knew About Search
Everything You Wish You Knew About Search
 
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
 
Data-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and HealthcareData-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and Healthcare
 
Generating Creative Works with AI
Generating Creative Works with AIGenerating Creative Works with AI
Generating Creative Works with AI
 
Using AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care DataUsing AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care Data
 
State of AI/ML in Real Estate
State of AI/ML in Real EstateState of AI/ML in Real Estate
State of AI/ML in Real Estate
 
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
 
Machine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life ScienceMachine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life Science
 

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Data Architecture (i.e., normalization / relational algebra) and Database Security

  • 1. Changing the Way the Financial World Processes & Utilizes Information Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 2. Introduction Speaker: Samuel Berger Topic: Data architecture (i.e., normalization / relational algebra) and database security Description: This presentation underscores the importance of creating precise data structures when handling, processing and manipulating mass amounts of data. As data has become key in the operations of virtually all major companies around the world, having the data easily maintained and utilized is pivotal. Companies often live or die in today’s hyper- competitive business climate by their ability to advantageously manipulate their data. It is therefore paramount that this enterprise- critical data is housed in well-organized structures that are intuitive for developers to work on. The bulk of this presentation offers tips and examples on how as well as the numerical benefits using a large data example. 1Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 3. Background I am a Fintech entrepreneur and developer as well as a data scientist having worked mostly in financial mass data projects since 1989. I started in these fields with SBIC, using technology and massive amounts of data to predict the world’s largest financial market – FOREX. My systems earned my clients (Daiwa Securities, Bank of Montreal, Julius Bär Group Ltd., Société Générale, royalty and national treasuries, to name a few) returns of over 18% per annum non-compounded over the 5 ½ years we traded. At peak my company traded the equivalent of over $1 billion in a day. Some of my other projects included: working on E*Trade’s E*Advisor system, VeriSign, SGI, two industry-founding VOIP unified messaging companies, and Enterprise Architect for Capital Group Companies (managed over $1.3 trillion at the time). I am currently working on a large project for CLEAR. 2Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 4. 3 Discussion Points I. Relational Algebra / Normalization Familiarity II. Brief History III. 1st Normal Form Simply Stated IV. Best Practices V. Practical Examples VI. Key Structures VII. Performance and Data Space / Maintenance VIII. Overloaded Domains (Columns) IX. Theory Modified Slightly by Practice X. Locking XI. Normalization Conclusion XII. Securing the Data Layer XIII. Problems with Outsourcing IT XIV. Data Theft – Primary Weaknesses XV. Q & A Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 5. Relational Algebra Database Normalization Have you worked on a database that was in at least the First Normal Form (1NF)? Does anyone know at what point in Normalization duplicates are no longer allowed? Does anyone know at what point NULLs are no longer allowed? 4Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 6. Brief History Relational Algebra was primarily developed by Edger F. Codd from 1969 to 1973, and primarily documented by Chris Date, both IBM employees. Codd also created his “12 rules” (really 13 as he started from zero) that were used to define the qualifications of a relational database management system (RDBMS). Codd’s work heavily influenced IBM’s first RDBMS called System R back in 1973. System R was created by Ray Boyce and Don Chamberlin. System R introduced the Standard Query Language (SQL), originally called SEQUEL while in development, hence the reason we still refer to SQL Server as “Sequal” Server. Codd and Boyce later teamed up to create the Boyce-Codd Normal Form, which is one step more confined than the 3rd Normal Form. 5Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 7. 6 1st Normal Form (1NF) My goal here is not to confuse but to simplify Key principles: 1) Each row must have at least one unique key also referred to as a Candidate Key (i.e., no duplicate rows). A Candidate Key is the minimum column grouping on a table to create a unique record. Columns that do not help to define uniqueness are attributes of the Candidate Key or Candidate Keys as the case may be should more than one unique column set exist. 2) Every row column intersection must have a value. 3) Every row column intersection can only contain one value, not a list of values. 4) Every row column intersection must have a valid value from the pool of potential valid values (i.e., a plane parts table cannot have a column for engine parts and then enter into it both engine parts as well as plane max speeds). 5) The functionality of the table is not dependent on the order of the data with respect to the order of rows or columns (i.e., querying the data will determine the column order and the row order of the output). Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 8. Best Practices Modern database terminology uses the term Primary Key as a binder of the data more than as a concept of a unique row identifier based on data properties. As such the Primary Key is now a separate concept from the Primary Candidate Key. It should always be an auto-growing integer starting sequentially from row 1, and the server prefers that it is the first column. My naming convention is the table name plus “_ID”. Table and column names should always be descriptive even if verbose. Mistakes occur most commonly due to lack of understanding of the data model and the purposes of each container. Never use database keyword names for column names (i.e., name is a keyword as is filename). Data should be related for the data’s sake and not for the current application requirements. Requirements change, if the data is structured accurately then the data model will remain accurate. 7Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 9. Audit Columns Also a component of best practices is to include audit columns to cover the following key pieces of data: 1) The date and time the record was created. 2) The person ID or process name that created the record. 3) The date and time the record was last updated. 4) The person ID or process name that made the update. 5) Update count – a truly critical column for every table that will be covered a bit later in this discussion. 6) A column indicating whether the record is currently active. Note: In blockchain projects records are never updated, only added. Also, additional information must be included to indicate whether or not the data is in sync with its partners. 8Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 10. Practical Example The following example is designed to help illustrate normalization. For this example I created a database schema for how I would build the Microsoft Explorer application from scratch. I believe everyone here has used Microsoft Explorer and can fully visualize this exercise and see some of the power of the Normalized architectural design. This small database meets the requirements of the Boyce-Codd Normal Form. NOTES: This is not the actual Microsoft Explorer data model. This is how I would design it. Also, I added silver keys to denote the Candidate Key to each table. Most tables have only one Candidate Key. For those with multiple I only used one to simplify the model for easier conceptual understanding. As such all silver keys on a table are used to create a single Candidate Key for each table. Lastly, audit columns do not apply to this simple application. 9Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 11. Explorer Data Model 10Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 12. Key Database Structure 11Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 13. Second Example “Persons” This example is far more interesting. I have taken a very common (maybe the single most common) database architecture and restructured it using relational algebra properly. It should be noted I have never seen this structure used anywhere in the world, much to my surprise. I do use it in CLEAR in multiple locations (i.e., not just with the Persons information). It is the correct usage of the mathematics and has massive benefits when applied to very large data sets that will be covered numerically. 12Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 14. Data and Architecture Notes 1) Loaded just over 265 million records into both structures to give proper time and size comparisons in a big data environment. 2) As there are duplicate records with respect to first, middle and last names along with birthdays, no candidate key is possible in the traditional structure. 3) I used the 2010 US Census data for a list of last names and their frequency. They only included non-concatenated last names, as such I had to create my own concatenated examples. 4) I used a list of 2016 baby names in Scotland as it was the largest first name database that I could locate with a breakout between male and female names. 5) I randomly generated the first and middle names from this list of known names. I also randomly generated the order to enter the names into both table schemas to prevent bias. 6) I did not add the audit columns or display the field specifications for conceptual simplicity. 13Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 15. Typical Persons Table 14Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. Notes: 1) No way to properly identify a candidate key in spite of important defining data. 2) MiddleName and Suffix will have to allow NULL or absent values. 3) Multiple middle names almost never supported.
  • 16. Normalized Persons Table Structure 15Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 17. 1st Key Database Structure 16Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 18. 17 2nd Key Database Structure 17Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 19. Statistics – Traditional vs. Normalized Traditional data structure Total data size that requires indexing for increased performance: 19.9 gigabytes – also requires multiple columns to be indexed and most likely multiple indexes Time to count number of people born on May 5 of any year: 3 minutes 57.4 seconds Total record count: 729,836 Time to return the ID for 1 person given first, last and middle names plus birthdate (without indexing): 50.4 seconds 18Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 20. Statistics – Traditional vs. Normalized (Continued) Normalized data structures Total data size that would require indexes for increased performance: 5.2 megabytes – only requires a single column to be indexed Time to count number of people born on May 5: 1.5 seconds Total records: 729,836 Time to return the ID for 1 person given first, last and middle names, and birthdate (without indexing): 1.2 seconds 19Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 21. Statistics – Traditional vs. Normalized (Continued) Difference Data requiring indexing: 3,943.3 times the data! Time difference to retrieve information by date (not a column that can be added by an index – so it is what it is): 153.9 times faster! Without indexing either table structure – time to retrieve a single record by customer specific data: 43.1 times faster! Conclusion: Normalization is always faster and massively more efficient with respect to data maintenance within a production transaction environment. 20Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved.
  • 22. 21 Overloaded Data Column 21Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. The First Normal Form requires each column to be a domain A domain is a column that contains data from the “pool of legal values”. Legal values for a ZipCode field are all known zip codes, not, for example, a street name. Columns that contain more than one informational piece are referred to as “overloaded”. It can be accurately argued that the datetime data type is the most commonly used within databases and is an overloaded column. In Microsoft SQL Server the function “DATEPART” allows the following retrievals: year weekday nanosecond quarter hour TZoffset month minute ISO_WEEK dayofyear second day millisecond week microsecond weekday nonosecond
  • 23. 22 Multiple Candidate Keys Boyce-Codd Normal Form Slight Flaw 22Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. This rule is 99% true. The exceptions primarily are with reference data as the difficulties maintaining data are moot at best. The DateInfo table to the right is an excellent example as the table data will never change. I typically add records to support multiple centuries (36,525 days per century – tiny amount of data for a table to support). A table must be in third normal form (3NF). Additionally, every domain that determines the value of another domain must be in part or in whole a key (Candidate Key) that has no overlapping domains with another key.
  • 24. 23 Database Table Locking The Power of Logical Level Locking 23Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. 1) In most RDBMS the default table locking for select statements is a shared lock. 2) Shared locks easily escalate in high volume production environments leading to poor performance and deadlocks. 3) In the vast majority of cases the lock proves unnecessary. 4) In almost all cases the locks can be completely avoided without creating concurrency issues by using an UpdateCount field, NOLOCK (or equivalent table locking hint) when selecting data and logic checks while conducting updates, inserts and deletes from the previously selected data. 5) In my experience the difference in high volume production environments is in almost all cases massive.
  • 25. 24 Normalization Conclusion 24Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. 1) Proper normalization of the data model can save companies working on big data or high production transaction databases tens of millions in hardware and maintenance expenses over the life of a company. 2) Performance in a production environment will always be more reliable and significantly faster using relational algebra. 3) No one claiming to be a professional database architect can make that claim without being proficient in relational algebra. 4) Even with data warehousing some data should always be normalized for maximum performance and flexibility. A good example is the date information, which can be used to rapidly slice and dice denormalized data marts efficiently for maximum flexibility with the data. 5) The database can make or break almost all projects. Proper database design, locking schema and efficient database code is always essential.
  • 26. 25 Securing the Data Layer 25Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. Once past the network security layer, which is often far more geared to protecting against outside intrusion, hackers often experience little to no real impediments to gaining access and control of the database servers. Protection should be at all layers with equal and extreme diligence. Aside from the common data protection deterrents, I have listed how to properly add security that, I believe, will give even the NSA ulcers if they should try to hack. Please note, just as there is a large gap between technology available and technology applied, there is also a large gap between known best practices and those practices actually applied. In most environments, big and small, developers, IT personnel and even sometimes executives want, and usually get, a back door entry into the production databases.
  • 27. 26 Securing the Data Layer (continued) 26Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. Recommendations: 1) Change the default port setting to an obscure port. Strangely, in my entire career I have never come to a company and seen any of their servers not functioning on the standard ports. Completely unnecessary. All ports outside of the random one chosen for DB server should be closed and the DB server port restricted to the DB application. 2) Deny data reader and data writer to all logins. Do not allow any login to the DB servers to have access to anything but executing stored procedures. No ad hoc querying or dynamic SQL allowed! Direct access to the data circumvents all business rules and allows direct access from the users to your data. Very bad practice and poor security. 3) Use a multi-Unicode 30+ character password for the database server system administrator account. 4) Deny all access by local administrators to the database layer. 5) Use a multi-Unicode 30+ character password for the NT Administrator account. 6) Disable all local administrators. 7) Use at least 3 multi-Unicode characters in all of the stored procedure names (i.e., characters from other languages). According to Wikipedia Unicode currently contains 136,755 distinct characters. Many are not allowed within SQL, but still the difference in combinations just in a 5 character name is staggering!
  • 28. 27 Securing the Data Layer (continued) 27Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. 8) Create an application for administrators that will open a port up and enable their OS administrator logins for a limited amount of time. That process needs to keep an audit trail that includes IP address and machine mac address. Mac addresses should be pre-authorized. 9) Alternate between logins (at least three) every minute changing the password to each every minute to a new, but calculable, password. Retain each password for three minutes to allow overlapping. Again, passwords should be long and multi-Unicode. 10) Production data is never to be shared with employees no matter what title they have or how much they complain. If they need information then a report can be developed for them that properly follows Sarbanes-Oxley (SOX) requirements and is well vetted and approved. If these standards are followed the production data will be secured. Programmers and IT personnel may complain, but they are not being paid large salaries to do easy work. Their work is first and foremost to protect the key company data, the integrity and privacy of client data, and make sure the companies’ products and services are highly available and dependable.
  • 29. 28 Problems with Outsourcing IT 28Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. The company accountants always like to look for ways to reduce expenses. The problem is at what cost. Some costs cannot be measured in mere P & L. One such cost is the proliferation of key company information and intellectual properties. The following are some of the dangers that occur when outsourcing: 1) Access to servers are granted with administration rights to persons selected by the outsourced company. No vetting, not even a list of those given access and their backgrounds, credentials, criminal history – anything at all… 2) Much of the company data is often accessible. That includes backups, company personnel information, contracts, bids, clients, data, inventions, etc.
  • 30. 29 Problems with Outsourcing IT (Continued) 29Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. 3) Outsourced IT companies often outsource themselves to companies abroad. Your data is then accessible to persons unknown in India, Pakistan and China to name a few. Those persons are completely unknown to you and beyond any legal restrictions of the United States. 4) Much has been stated by the government and the news about China and other countries hacking and stealing our key data. I propose they are not stealing it as much as we are giving it to them and, to add insult to injury, we are actually paying them to take it. Some expenses simply make sense and are a cost of doing business in this day and age. IT is one of those key mission critical expenses. Live with it… Never lose control of your company’s life’s blood.
  • 31. 30 Data Theft – Primary Weaknesses 30Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. 1) Approximately half of all data theft incidences are from employees. 2) That does not count those that you have given access via outsourcing your IT. 3) Most large data breaches occur from employees stealing entire backups and having access to large file data stores of documents and company critical information. 4) Almost all of Julian Assange’s information that he publishes via WikiLeaks come from employee data theft. Primarily stolen backup tapes. 5) In 98% of companies less than half of the data tape backup files are encrypted according to surveys of IT professionals.
  • 32. 31 Data Theft – Primary Weaknesses (Continued) 31Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. 6) Symptoms of data theft from source with internal company granted access: a) Size of the amount of data: hackers want to be quick and are naturally worried about being caught, so they filter their data searches to find the critical documents or data rapidly. Those working within have no such time constraints and tend to be far less skilled, so searching is left to those receiving the data from the thief. b) Breadth of the data theft: hackers are focused on who they want files from again limited by time. Those working within tend to get all users’ data. c) Scope of the data: Once in the hacker will look around as quickly as possible and attempt to gain information from multiple points within the network. Employees tend to get all of one type of data that they are focused on. Usually pertaining to what they are specifically working on, have specific granted permissions to access, and have an issue.
  • 33. 32 Data Theft – Primary Weaknesses (Continued) 32Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. Profiling data theft can assist when protecting your data. Know your data and the type of interests that will want to take, change or distribute your key company information. A high profile theft that took place last year with the DNC fits the symptoms more of an internal data theft than external. The fact that the FBI was refused access and simply, and unprofessionally, accepted the DNC’s word for who hacked them is not surprising as the FBI is way behind with respect to investigating and prosecuting data theft. Do not anticipate the government to protect your data anytime for the foreseeable future. Or for that matter to properly investigate or prosecute those guilty in its theft. It is on you!
  • 34. 33 Changing the Way the Financial World Processes & Utilizes Information Thank You Samuel Berger Chief Information Officer (805) 701-0761 sberger@ClearFinTech.com Copyright © 2017 CLEAR Information, Inc., a United States Class C corporation, all rights reserved. George Sterling Harris Executive Vice President (310) 295-7524 gharris@ClearFinTech.com