Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Future of The
Past
The New York Times and the Challenge of Archives
Evan Sandhaus,  
Sophia Van Valkenburg  
Jane Cotler 
The New York Times
@nytarchives

A Problem of Archives
“How do you faithfully represent
information created with one
technology using another?”

A Problem We Know Well
• Migrating The Index to The Times Information Bank
• Migrating The Microﬁlm Archive to TimesMachine
• Migrating Legacy Web Content to Modern Online
Presentation (or the challenge of multiple legacy
formats)

The Problem By The
Numbers
60,000
Issues Published Since
September 18, 1851
Almost

The Problem By The
Numbers
3,500,000+
Unique Pages Printed Since
September 18, 1851

The Problem By The
Numbers
15,000,000+
Articles Published
September 18, 1851

Digital Archives
1851
-
1859
1860
-
1865
1866
-
1949
1970
-
1980
1981
-
1995
1996
-
2016
Full Text NYT5
Full Text NYT4
Abstracts NYT4
Abstracts NYT5
1950
-
1959
1960
-
1969

The New York Times
Information Bank

The New York Times Company Archives

The Deep Archive
0
45000
90000
135000
180000
1851
1858
1865
1872
1879
1886
1893
1900
1907
1914
1921
1928
1935
1942
1949
1956
1963
1970
1977
1984
1991
1998
2005
2012
Scanned Articles Digital Articles Blogs
≈75% ≈25%

The Numbers
46,592
Issues Published Since
September 18, 1851

The Numbers
2,335,446
Unique Pages Printed Since
September 18, 1851

The Numbers
11,298,320
Articles Published
September 18, 1851

The Scanned Archive
Headline
CROWD ROARS THUNDEROUS WELCOME;
Breaks Through Lines of Soldiers and Police and
Surging to Plane Lifts Weary Flier from His Cockpit
AVIATORS SAVE HIM FROM FRENZIED MOB OF
100,000 Paris Boulevards Ring With Celebration
After Day and Night Watch -- American Flag Is
Called For and Wildly Acclaimed

The Scanned Archive
Lede Paragraph
PARIS, May 21. -- Lindbergh did it. Twenty minutes
after 10 o'clock tonight suddenly and softly there
slipped out of the darkness a gray-white airplane as
25,000 pairs of eyes strained toward it. At 10:24 the
Spirit of St. Louis landed and lines of soldiers, ranks
of policemen and stout steel fences went down
before a mad rush as irresistible as the tides of the
ocean.

The Scanned Archive
“Dirty” ASCII
…Lifte Fro'm His Cockpit. As he was lifted to the
ground Lindbergh w as l,-:k:, :::. - hair unkempt, he
looked completely worn out. lle h-:: strength
enough, however, to smile, and waved his hand to
t? ' crowd. Soldiers with ﬁxed bayonets were unable
to keep bach the crowd. United States Ambassador
Herrick was among the ﬁrst to welcome and
congratulate the hero.s…

The Scanned Archive
Indexing Metadata
Headings
People, Places, Organizations, Subject
Abstracts
Concise summary of the facts in the article

The Problem
• As a subscriber exclusive TimesMachine does not
appear in Google Search results.
• Lack of full text before 1980 makes it difﬁcult to
rank, or even appear, in Google results.
• For example: In 1945 The Times published 161,961
articles and only a tiny fraction appear in Google
results.

The Solution
• Transcribe articles from archival scans and publish
these assets as searchable pages on nytimes.com.
• Transcribe and publish 1964 as pilot.
• If that works transcribe and publish all remaining
articles between 1960-1980.

Progress & Results
• All articles between 1960-1980 transcribed.
• All articles between 1970-1979 available on
nytimes.com with more to come.
• Google now indexing 672,500 new assets published
between 1970-1979!
• Plans to publish 1960-1969, and to monitor
performance of new pages.

Archival Content on
NYTimes.com

The Initial Solution
new format for CMS
(JSON)
print data
(XML)

The Case Of The Missing
Articles

Articles
web data
(HTML)
new format for
CMS (JSON)
print data
(XML)

The Case of the Missing
Articles
1. What is the complete list of article URLs from
1996-2006?
2. How do we identify which of the missing web
articles correspond to existing print articles so that
we can combine them and avoid duplicate content?
3. Which articles are web-only and not in our print
archive at all, and how do we scrape that page for
content & metadata?
4. Can we build a system that will process all the data
for each year easily & efﬁciently?

The Deﬁnitive List of Articles
4 different sources:
1. Print archive
2. Site analytics (from the past 6 months)
3. Movie, theater, and restaurant reviews
4. Sitemaps

The Archive Migration
Pipeline For A Given Year
archive
XML
deﬁnitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
ﬁles
JSON with
no
duplicate

The Archive Migration
Pipeline3%
12.9%
36.2%
48.3%
Print Archive (56K)
Print Archive and Web (42K)
Web-only (15K)
Bad urls (3K)
2004 Articles (116K total)

All The Little Things…
• 1996
• Article Matching
• Better URLs
• Quality Assurance
• Next Steps

Article Matching: Fusion
archive
XML
deﬁnitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
ﬁles
JSON with
no
duplicate

Fusion Explained
web data
(HTML)
print data
(XML)

Search Engine Optimization
27iht-scoutus.t.html

Search Engine Optimization
curb-violates-free-speech-supreme-court-
rules-72-justices-void-internet.html

Sections

Next Steps
1851
-
1859
1860
-
1865
1866
-
1949
1970
-
1980
1981
-
1995
1996
-
2016
1950
-
1959
1960
-
1969
Full Text
No Full Text

Next Steps
Digital preservation

Thank You!
Evan Sandhaus, Sophia Van Valkenburg, Jane
Cotler
The New York Times

Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Similar to Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past (20)

More from Reynolds Journalism Institute (RJI)

More from Reynolds Journalism Institute (RJI) (20)

Recently uploaded

Recently uploaded (20)

Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past