As an executive director of technology at The New York Times, Evan Sandhaus leads the teams responsible for searching, displaying, organizing and delivering the 15 million articles that constitute The Times’ 163-year-old archive. In more than a decade with The Times, Sandhaus has created a new TimesMachine, directed The Times Linked Open Data initiative and collaborated with major search companies on schema.org. Sandhaus represents The Times on the board of the International Press Telecommunications Council and serves on its board of directors. Originally from Kansas, he holds degrees in computer science from both Williams College and Villanova University.
Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past
1. The Future of The
Past
The New York Times and the Challenge of Archives
Evan Sandhaus,
Sophia Van Valkenburg
Jane Cotler
The New York Times
@nytarchives
4. A Problem of Archives
“How do you faithfully represent
information created with one
technology using another?”
5. A Problem We Know Well
• Migrating The Index to The Times Information Bank
• Migrating The Microfilm Archive to TimesMachine
• Migrating Legacy Web Content to Modern Online
Presentation (or the challenge of multiple legacy
formats)
6. The Problem By The
Numbers
60,000
Issues Published Since
September 18, 1851
Almost
7. The Problem By The
Numbers
3,500,000+
Unique Pages Printed Since
September 18, 1851
8. The Problem By The
Numbers
15,000,000+
Articles Published
September 18, 1851
24. The Scanned Archive
Headline
CROWD ROARS THUNDEROUS WELCOME;
Breaks Through Lines of Soldiers and Police and
Surging to Plane Lifts Weary Flier from His Cockpit
AVIATORS SAVE HIM FROM FRENZIED MOB OF
100,000 Paris Boulevards Ring With Celebration
After Day and Night Watch -- American Flag Is
Called For and Wildly Acclaimed
25. The Scanned Archive
Lede Paragraph
PARIS, May 21. -- Lindbergh did it. Twenty minutes
after 10 o'clock tonight suddenly and softly there
slipped out of the darkness a gray-white airplane as
25,000 pairs of eyes strained toward it. At 10:24 the
Spirit of St. Louis landed and lines of soldiers, ranks
of policemen and stout steel fences went down
before a mad rush as irresistible as the tides of the
ocean.
26. The Scanned Archive
“Dirty” ASCII
…Lifte Fro'm His Cockpit. As he was lifted to the
ground Lindbergh w as l,-:k:, :::. - hair unkempt, he
looked completely worn out. lle h-:: strength
enough, however, to smile, and waved his hand to
t? ' crowd. Soldiers with fixed bayonets were unable
to keep bach the crowd. United States Ambassador
Herrick was among the first to welcome and
congratulate the hero.s…
27. The Scanned Archive
Indexing Metadata
Headings
People, Places, Organizations, Subject
Abstracts
Concise summary of the facts in the article
30. The Problem
• As a subscriber exclusive TimesMachine does not
appear in Google Search results.
• Lack of full text before 1980 makes it difficult to
rank, or even appear, in Google results.
• For example: In 1945 The Times published 161,961
articles and only a tiny fraction appear in Google
results.
31. The Solution
• Transcribe articles from archival scans and publish
these assets as searchable pages on nytimes.com.
• Transcribe and publish 1964 as pilot.
• If that works transcribe and publish all remaining
articles between 1960-1980.
32. Progress & Results
• All articles between 1960-1980 transcribed.
• All articles between 1970-1979 available on
nytimes.com with more to come.
• Google now indexing 672,500 new assets published
between 1970-1979!
• Plans to publish 1960-1969, and to monitor
performance of new pages.
39. The Case Of The Missing
Articles
web data
(HTML)
new format for
CMS (JSON)
print data
(XML)
40. The Case of the Missing
Articles
1. What is the complete list of article URLs from
1996-2006?
2. How do we identify which of the missing web
articles correspond to existing print articles so that
we can combine them and avoid duplicate content?
3. Which articles are web-only and not in our print
archive at all, and how do we scrape that page for
content & metadata?
4. Can we build a system that will process all the data
for each year easily & efficiently?
41. The Definitive List of Articles
4 different sources:
1. Print archive
2. Site analytics (from the past 6 months)
3. Movie, theater, and restaurant reviews
4. Sitemaps
42. The Archive Migration
Pipeline For A Given Year
archive
XML
definitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
files
JSON with
no
duplicate
43. The Archive Migration
Pipeline For A Given Year
archive
XML
definitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
files
JSON with
no
duplicate
44. The Archive Migration
Pipeline For A Given Year
archive
XML
definitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
files
JSON with
no
duplicate
45. The Archive Migration
Pipeline For A Given Year
archive
XML
definitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
files
JSON with
no
duplicate
46. The Archive Migration
Pipeline For A Given Year
archive
XML
definitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
files
JSON with
no
duplicate
47. The Archive Migration
Pipeline For A Given Year
archive
XML
definitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
files
JSON with
no
duplicate
49. All The Little Things…
• 1996
• Article Matching
• Better URLs
• Quality Assurance
• Next Steps
50. Article Matching: Fusion
archive
XML
definitive
list of URLs
extracted
URLs
missing
URLs
missing
HTML
URLs with
no article
body
XML to
HTML
matches
unmatched
HTML
JSON from
XML and
HTML
JSON from
unmatched
HTML
skipped
files
JSON with
no
duplicate