Unveil Historical Obituary Data With Web Scraping Solution

A web scraping solution to automate the process of collecting and processing historical obituary data across a large number of digital newspaper archives and open-source sites.

SERVICE OFFERS: Historical Data Web Scraping Solution

BUSINESS CHALLENGES

Our Client

DIGI-TEXX’s client is the world’s leading provider of family history and genealogy services, with its main office located in the US. As part of the company’s 20 years of collecting, indexing, and digitizing efforts, they currently manage almost 7 billion records. These records include immigration, military service, marriages, and much more.

Project Challenges

Tracing The Historical Profiles

When it comes to researching genealogy, the obituary is a treasure trove of crucial data. Beyond basic biographical details like name, birth, and death dates, they provide insights into geographical locations, relatives’ names, and other important data that may be difficult to find in other historical sources.

Unveil Historical Obituary Data With Web Scraping Solution

Nevertheless, collecting and processing obituary data remains a challenging task for our clients due to several factors:

  • Handling diverse and enormous data sources: A vast amount of historical obituary data is scattered across thousands of digital resources from newspapers, libraries, governments, churches, universities, and funeral home websites,…
  • Manual Inefficiency: Manually extracting and indexing different formats, data types, and complex web structures is time-consuming and costly.
  • Data quality assurance: Ensuring data accuracy, completeness, and consistency can be challenging due to errors, unstructured data, and missing information

Project Scope

The project aims to develop a robust solution that assists clients in automatically collecting historical obituary data across digital resources. Then, the collected data will be standardized to ensure consistency and quality of information.

  • Volume: 150,000 records per URL with 20 URLs/month
  • Extracted fields in each record include a person’s name, gender images, birthplace, age, residence, date of death, location, cause of death,..

SOLUTION

Historical Data Web Scraping Solution

To address these challenges, DIGI-TEXX developed a sophisticated web scraping solution to automate the process of collecting and processing historical obituary data across a large number of digital newspaper archives and open-source sites. This would enhance the database, providing their users with access to millions of new records. 

Unveil Historical Obituary Data With Web Scraping Solution
  1. Scraper Development: Our team has integrated machine learning algorithms and natural language processing (NLP) to develop a robust web scraper with the capacities of:
    • Accessing online newspaper archives (e.g., Newspapers.com, GenealogyBank, Legacy.com) and online websites including schools, hospitals, churches, and much more.
    • Navigating through a vast array of sitemaps, sections, and search results
    • Identifying relevant obituary pages based on keywords and layout patterns
    • Collecting the necessary data fields from a targeted website, including the person’s name, birthplace, age, date of death, residence, cause of death,…and images.
    • Consolidating collected data into a central repository ensures one record per unique individual.
    • Handling various data formats and structures (PDF, HTML, images)
    • Implementing filters and algorithms for error handling, retry mechanisms, and anti-scraping mitigation.
  2. Data Validation: The collected data was cleaned, standardized, and formatted to match the Client’s database structure.

BUSINESS OUTCOME

  • Optimized data processing time:
    • 20-30 minutes for text-used URL 
    • 2-3 days for more complex URL
  • Expanded Database: Delivered over 150,000 records per URL, significantly expanding the client’s database and providing users with access to millions of historical records.
  • Improved Data Quality: Achieved an accuracy rate of 95%, ensuring reliable and accurate obituary data.
Unveil Historical Obituary Data With Web Scraping Solution

RELATED CASE STUDIES

Unveil Historical Obituary Data With Web Scraping Solution

Unveil Historical Obituary Data With Web Scraping Solution

BUSINESS CHALLENGES Our Client DIGI-TEXX’s client is the world’s leading provider of family history and genealogy services, with its main ...

Data Extraction Solution For Customer Onboarding v6

Data Extraction Solution for Customer Onboarding Straight-Through Process

We serve a leading international insurance and financial services company with over 1.5 million customers operating in Asia, Canada, and the United States.

Data annotation Object Detection

Object Detection and Labeling in The Construction Sector 

Our client is a company specializing in AI and Computer Vision in the construction sector. They have gained recognition for their AI-driven construction solutions...

SHARE YOUR CHALLENGES