Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Document Databases In Online Publishing
1. My
name
is
Irakli.
Let
me
give
you
some
background
about
myself
and
how
I
tricked
conference
organizers
into
thinking
that
I
was
qualified
to
talk
today.
J
I
am
a
director
of
engineering
at
Na?onal
Public
Radio.
Which
is
a
fancy
way
of
saying:
I
lead
the
soDware
team
that
is
responsible
for
the
code
behind
npr.org,
NPR
API
and
NPR
mobile
apps.
Prior
to
joining
NPR,
I
spent
several
years
developing
open-‐source
products
for
the
online
publishing
industry.
Some
of
these
products
are
now
used
by
news
organiza?ons
like:
The
Na?on,
The
New
Republic,
Thomson
Reuters
and
Al
Jazeera.
I
have
been
using
document-‐based
[or,
so-‐called:
NoSQL]
databases,
on
and
off,
for
almost
a
year,
now
and
have
enjoyed
the
experience
a
lot!
Because
I
enjoyed
it
so
much,
I
wanted
to
share
my
story
at
this
conference.
I
contacted
the
organizers
and
they
kindly
agreed
[I
hope
they
will
not
regret
it
by
the
?me
we
are
done
J].
So
here
it
is:
one
guy’s
story
of
falling
in
love
with
the
document
databases
and
why
he
thinks
they
have
a
significant
role
in
online
publishing,
specifically.
1
2. One
of
the
main
reasons
why
I
love
document
databases
is:
because
it
is
a
truly
disrup?ve
technology.
And
when
we
say
“disrup?ve
technology”
we
mean
something
so
innova?ve
that
it
helps
create
fundamentally
new
value
network,
thus
altering
exis?ng
market
and
disrup?ng
legacy
technologies
in
the
market.
The
innova?on
of
disrup?ve
technologies
is
not
just
an
incremental
progression
over
exis?ng
capabili?es.
Rather
it
is
a
fundamentally
re-‐thought,
novel
approach
to
solving
hard
problems.
For
instance,
there’re
many
good
SQL
databases,
both
open-‐source
as
well
as:
commercial.
And
everybody
has
their
favorite:
some
like
SQL
server
X’s
simplicity,
others:
love
the
power
of
the
database
Y
etc.
But
fundamentally
SQL
is
one
way
to
model
data
and
solve
data-‐warehousing
problems.
It
has
its
?me-‐proven
advantages,
as
well
as
some
significant
shortcomings.
Document
databases
are
an
architecturally
different
approach
to
solving
data
problems.
They
are
not
a
drop-‐in
replacement
or
an
incremental
improvment
over
SQL.
They
do
have
their
own
shortcomings,
but
they
also
allow
solving
problems
that
were
either
very
hard
or
impossible
to
solve
with
the
tradi?onal,
SQL-‐oriented
databases.
2
3. Tradi?onal,
SQL
database
theory
has
strong
emphasis
on
ACID
compliance.
You
probably
remember
that
ACID
stands
for:
Atomicity,
Consistency,
Isola?on
and
Durability.
The
Consistency
property
ensures
that
no
database
transac?on
violates
referen?al
integrity
rules
defined
in
the
database
schema.
Isola?on
is
a
requirement
that
asserts
that,
given
concurrent
access
to
data,
parallel
opera?ons
cannot
access
data
that
is
being
modified
by
a
another
transac?on,
but
have
to
wait
un?l
the
transac?on
completes.
Isola?on
is
commonly
implemented
with
pessimis?c
locking.
Isola?on
and
Consistency
requirements
in
ACID-‐compliance
cons?tute
a
fundamental
problem
for
system’s
scalability.
3
4. To
put
it
in
the
words
of
Werner
Vogels,
CTO
of
Amazon
and
one
of
the
foremost
experts
in
the
field
of
distributed
compu?ng:
“If
you’re
concerned
about
scalability,
any
algorithm
that
forces
you
to
run
agreement
will
eventually
become
your
boaleneck.
Take
that
as
a
given.”
ACID-‐compliance
is
all
about
various
processes
[and
nodes],
in
the
system,
checking
with
each-‐other
to
keep
data
consistent
across
the
en?re
system.
Therefore,
it’s
not
as
much
about
how
well-‐implemented
master-‐slave
or
master-‐master
replica?on
in
your
database
is,
but
the
bigger
challenge
is
the
architectural
constraint
that
ACID-‐
compliance
imposes
on
scalability.
4
5. How
important
is
scalability
for
a
Web
system?
Is
it
something
that
maaers
just
for
Amazon,
Facebook,
Google
and
alike?
Internet
is
an
incredibly
fast-‐growing
medium.
It
took
radio
38
years
aDer
introduc?on
to
reach
50
MM
users,
it
took
television
13
years,
Internet
did
it
in
just
4
and
it
has
been
growing
exponen?ally
ever
since.
5
6. In
a
report
published
in
June,
this
year,
Cisco
forecasted
that
global
IP
traffic
will
quadruple
by
2015.
It
means:
more
users,
larger
amount
of
content,
more
types
of
content,
more
sources
of
content
and
more
real-‐?me
content.
In
this
context,
by
“real-‐?me-‐content”
I
mean
things
like:
check-‐ins,
coverage
of
live
events
and
ci?zen
journalism
during
breaking
news.
Now,
most
of
us
in
the
content-‐produc?on
industry,
believe
that
having
more
traffic
and
more
content
is
good
news.
Scratch
that:
it’s
great
news!
As
a
maaer
of
fact,
Internet
community
has
goaen
so
obsessed
by
the
amount
of
website
traffic
that
it
is
oDen
used
as
the
most
significant
measure
of
a
website’s
success
or
failure.
So:
more
traffic
is
good
news…
except
and
unless
you
are
the
developer
responsible
for
making
sure
the
website
is
s?ll
up
and
running
when
traffic
quadruples.
6
7. We
started
scalability
discussion
by
men?oning
the
scalability
limita?ons
that
ACID-‐
compliance
requirement
enforces.
This
constraint
is
actually
a
specific
case
of
a
more
generic
theorem
called:
Brewer’s
or
CAP
Theorem.
The
theorem
was
formulated
as
a
conjecture
by
a
UC
Berkeley
professor:
Eric
Brewer
in
2000.
Two
years
later,
Seth
Gilbert
and
Nancy
Lynch
of
MIT
published
a
formal
proof
of
Brewer's
conjecture.
CAP
Theorem
states
that,
when
designing
distributed
soDware
systems
there
are
three
proper?es
that
are
commonly
desired:
1. Consistency
2. Availability
and
3. Par??on
Tolerance,
Theorem
proves
that
it
is
impossible
to
achieve
all
three
at
the
same
?me[1].
Even
though
names
sound
intui?ve,
it
is
probably
worth-‐while
to
clarify
what
Gilbert
and
Lynch
meant
by
each
of
the
defini?ons
in
CAP,
since
there
are
mul?ple
(some?mes
contradictory)
and
confusing
defini?ons
floa?ng
around
the
web.
7
8. Consistency
basically
stands
for
the
requirement
that
all
nodes
in
a
distributed
system
must
see
the
same
data
all
the
?me
(subset
of
ACID
compliance).
Availability
means:
every
request
should
succeed
to
receive
a
response.
System
as
a
whole
should
be
highly
available.
Par??on
Tolerance,
in
a
distributed
system,
means
system
should
allow
some
fault-‐
tolerance.
When
some
nodes
crash
or
some
communica?ons
links
fail,
it
is
important
that
system
s?ll
performs
as
expected.
8
9. Let’s
look
at
some
popular
distributed
data
storage
systems
that
you
are
probably
familiar
with
and
see
which
bucket
they
fall
into
in
the
CAP
spectrum.
Rela?onal
databases,
LDAP
directory
servers
and
xFS
file-‐systems
are
all
examples
of
consistent
and
available
distributed
systems.
They
are
consistent
because
they
provide
ACID
compliance.
They
are
not
par??on-‐tolerant
because
they
do
not
have
a
quorum
system
for
removing
unreachable
nodes
from
the
system.
9
10. MongoDB,
Terrastore,
Redis
and
BigTable
all
guarantee
consistency,
and
they
use
quorum
for
par??on
tolerance
but
they
forfeit
Availability.
10
11. Domain
Name
Service
(yeap,
the
one
that
drives
all
internet
traffic),
CouchDB,
Riak
and
Cassandra
are
all
examples
of
Available
and
Par??on-‐tolerant
distributed
systems.
They
do
not
guarantee
consistency.
Rather
they
provide
a
promise
of
something
known
as
“eventual
consistency”.
For
any
given
request,
you
may
receive
a
value
that
is
globally
stale
(system-‐wide)
and
definitely
not
isolated
per
ACID-‐compliance
requirements,
but
eventually
all
nodes
will
sync-‐up.
Not
“running
agreement-‐based
algorithm”,
that
Amazon’s
Werner
Vogels
was
preaching,
is
exactly
the
sacrifice
that
systems
like
CouchDB
and
DNS
make
to
provide
extreme
scalability
and
fault-‐tolerance.
11
12. In
his
2000
keynote
at
the
ACM
Symposium
on
Principles
of
Distributed
Compu?ng
(the
same
one
where
he
formulated
CAP
theorem),
Dr.
Brewer
also
came
up
with
a
new
defini?on
he
called:
BASE.
BASE
stands
for:
Basically
Available
SoD-‐state,
Eventual-‐consistency.
He
formulated
and
used
BASE
principles
to
demonstrate
the
trade-‐offs
and
differences
from
ACID-‐compliant
systems
12
13. ACID-‐compliant
systems
have
following
traits:
consistency,
isola?on,
focus
on
commit,
nested
transac?ons,
pessimis?c
locking
and
typically
they
are
fixed
schema-‐
based,
therefore:
inflexible
to
evolve.
13
14. In
contrast,
BASE
systems
exhibit:
weak
consistency,
availability
priori?zed
above
else,
best-‐effort
approach
to
conflict-‐resolu?on,
op?mis?c
locking.
Systems
with
the
BASE
philosophy
consider
approximate
responses
to
be
OK,
are
architecturally
simpler,
faster
and
evolve
flexibly,
since
they
are
typically
schema-‐less.
14
15. CouchDB
is
not
a
“beaer
MySQL”
or
a
“simpler
Oracle”.
It
is
really
good
at
availability
and
par??on
tolerance
and
has
many
traits
making
it
a
beaer
tool
for
some
of
the
problems
tradi?onally
solved
with
rela?onal
databases.
But
one
thing
it
is
not:
it
is
not
a
drop-‐in
replacement
for
SQL
databases.
There
are
tradeoffs
when
choosing
a
document
database,
and
specifically:
CouchDB.
The
most
obvious
and
honestly
“scary”
tradeoff
is:
forfei?ng
Consistency.
We
as
computer
scien?sts
were
trained
hard
and
log
that
data
must
be
consistent,
models
must
be
normalized,
referen?al
integri?es
must
be
maintained
and
etc.
How
can
we
even
dream
about
forfei?ng
consistency
even
for
scalability
and
fault-‐
tolerance?
15
16. The
reality,
however
is
that
there
are
systems
engineering
problems
where
strict
data
consistency
is
crucial,
but
there
are
many
where
-‐
it
is
not.
If
you
are
building
a
stock
trading
soDware
you
should
probably
use
a
data
storage
that
guarantees
consistency.
Financial
systems,
in
general
require
high-‐level
of
consistency,
but
it
is
not
given
for
just
any
system.
Anybody
who
has
built
a
real-‐life,
high-‐throughput
system
knows
that
in
many
cases
you
end-‐up
de-‐normalizing
data
model
to
allow
for
beaer
performance.
It
is
similar
to
forfei?ng
consistency
in
the
CAP
model.
With
a
document-‐based
database
like
Couch,
some
of
your
request
may
occasionally
return
slightly
stale
data.
Addi?onally,
data
in
document
format
is
oDen
highly
de-‐
normalized
and
less
referen?ally
consistent
than
data
in
a
fully
normalized,
rela?onal
database.
However,
if
you
are
building
a
news
publishing
website
none
of
this
is
unheard
of.
High-‐traffic
news
websites
have
been
de-‐normalizing
data
and
implemen?ng
aggressive
caching
for
years.
This
is
neither
new
or
radical.
On
the
contrary,
instead
of:
home-‐cooked
and
half-‐baked,
proprietary
solu?ons,
now
we
can
use
a
standard,
open-‐source,
highly
op?mized,
well
tested
solu?on
like
CouchDB.
Personally,
I
think
it’s
a
preay
good
deal.
16
17. At
this
point,
I’ve
spent
good
por?on
of
this
presenta?on
explaining
the
scalability
profile
of
CouchDB
(and
similar
systems);
discussed
how
improvements
are
not
quan?ta?ve
but
are
fundamentally
qualita?ve.
We
have
also
talked
about
tradeoffs
that
the
increased
availability
imposes.
Let’s
forget
about
scalability
for
now,
however,
and
talk
about
other
characteris?cs
of
CouchDB
as
a
document
storage
engine.
ADer
all,
CouchDB
is
not
the
only
document
database
and
there
are
document
databases
that
do
guarantee
data
consistency,
so
forfei?ng
consistency
is
actually
a
trait
of
AP
systems
(in
CAP
model),
not:
that
of
document
databases
in
general.
An
important
trait
of
document
databases,
however,
is
that
they
are
schema-‐less.
There
is
no
pre-‐defined,
strict
schema,
no
table
structures
or
rigid
rela?onships
between
document
types.
Document
types
live
in
a
free
world
and
evolve
very
flexibly.
17
18. OK,
this
is
by
far
one
of
my
ugliest
slides.
And
what
you
see
here
is
a
rough
ER
diagram
generated
off
a
fresh,
vanilla
installa?on
of
a
popular
open-‐source
content
management
system:
Drupal.
There
are
72
tables
on
this
diagram.
Some
of
you
may
be
familiar
with
Drupal.
It
is
highly
extensible
(and
generally
really
awesome),
but
it
does
not
do
much
out
of
the
box.
So
when
we
used
Drupal
for
crea?ng
websites
like
that
of
The
Na?on
or
The
New
Republic,
we
installed
dozens
of
addi?onal
Drupal
modules
and
wrote
a
bunch
on
top
ourselves.
Meaning:
we
added
even
more
tables.
And
you
can
clearly
see
how
unreadable
this
schema
already
is.
Obviously
we
never
even
tried
to
visualize
en?re
data-‐model
on
any
real
projects,
because
it
would
have
been
useless.
18
19. The
same
data
model
in
a
document-‐based
database,
would
look
like
this:
(see
slide)
I
know,
I
know!
I
am
exaggera?ng,
obviously
we
would
have
more
than
one
logical
type
of
a
document
even
in
a
document
database,
but
schema-‐less
modeling
means:
at
the
physical
level
it
is
just
one
document
type,
so
what
you
see
here
is
really
not
that
far
from
reality
as
far
as
actual
data
storage
goes.
Most
things
above
and
beyond
are
really
part
of
the
applica?on
logic
and
business
rules.
Since
my
presenta?on
is
one
of
the
last
ones
at
this
conference,
I
am
sure
you
have
already
listened
to
presenters
who
went
in
great
detail
about
data-‐modeling
in
CouchDB
and
I
am
sure
they
are
much
bigger
experts
of
the
subject
than
I
am.
So
I
will
spare
you
the
experience.
Suffice
to
say
that
embedding
documents
greatly
simplifies
data
models.
Think
about
just
the
amount
of
so-‐called
“mapping”
tables
that
rela?onal
systems
need
to
model
things
like:
many-‐to-‐many
rela?onships.
Also,
in
the
case
of
online
publishing
specifically,
most
business
objects
are…
well,
documents
so
having
a
storage
engine
that
operates
in
terms
of
documents
is
extremely
natural
and
enjoyable.
There’s
much
less
discrepancy
between
physical
and
logical
models.
Things,
in
most
cases,
just
make
sense
and
fall
in
line
naturally.
19
20. Another important, stark difference between relational databases and CouchDB
is the absence of a query language. As most other things about CouchDB, it’s
pretty “scary” for the newcomers. So much so, that some other document
databases have actually opted to implementing an SQL-like syntax (MongoDB
for instance) and I know a lot of people who appreciate that.
In contrast, CouchDB uses Map/Reduce, first filtering the data with a Map
function and then (optionally) grouping it with a Reduce function, if needed. The
documents, result of a map function as well as reduce function are all saved on a
B-tree (the secret sauce of CouchDB’s performance). If in a relational database
you would have normalized data and then you would index some columns from
that data, most things in Couch are a B-tree index to begin with.
This has significant consequences and much like in the case with forfeiting data
consistency, there are some real trade-offs to be made. While Map/Reduce is
very powerful, obviously you will find some queries that you could run in SQL
that are either impossible to model with a View or are too expensive/too slow.
Also, Views are not as dynamic as SQL queries. They are built incrementally and
a complete rebuild of one, in a large database is an expensive operation. As
such, it really pays off to carefully think through the Views that a system will be
using at the early stages of the system design.
20
21. The
good
news
is:
in
online
publishing
most
user-‐facing
content
is
a
document
type,
a
lis?ng
of
documents
and
an
aggrega?on
-‐-‐
exactly
the
things
that
document-‐based
databases
and
CouchDB’s
Views
are
highly
op?mized
for.
As
a
maaer
of
fact,
at
NPR,
to
withstand
millions
of
unique
users
that
the
main
website
gets,
our
legacy
system
uses
an
architecture
with
very
similar
constraints.
It
has
content
objects
that
are
serialized
XML,
XML
lists
of
content
objects
and
aggrega?ons
also
represented
in
an
XML
format.
While
in
the
back-‐end
we
do
use
an
SQL
database,
the
front-‐end
architecture
has
made
many
architectural
decisions
similar
to
those
made
in
CouchDB.
Yes,
the
legacy
system
uses
XML
instead
of
JSON…
I
know,
I
know!
But
we
have
been
running
our
systems
for
a
long
while,
so
some
of
it
pre-‐dates
the
?me
when
JSON
got
all
sexy
and
trendy
J
21
22. To
summarize,
AP-‐style
(as
defined
by
CAP
model)
document
databases
exhibit
following
traits,
important
for
online
publishing
systems
that
get
significant
traffic
and
have
real-‐?me
content
streams:
-‐ High
availability
-‐ Par??on
Tolerance
-‐ Schema-‐less
architecture
-‐ Document-‐oriented
storage
-‐ Index-‐based
semi-‐dynamic
querying
like
that
in
CouchDB
Views.
The
benefit
from
each
one
of
these
features
is
a
result
of
a
tradeoff.
For
teams
architec?ng
systems
and
implemen?ng
document
databases,
it
is
crucial
to
understand
and
appreciate
the
tradeoffs
made.
That
said,
document
databases
are
disrup?ve,
benefits
they
provide
are
real
and
ignoring
them,
not
augmen?ng
tradi?onal,
rela?onal
storage
systems
with
document-‐based
ones
would
be
a
mistake.
22