Online Historical Obituary Data Collection With Web Scraping Solution

A web scraping solution to automate the process of collecting and processing historical obituary data across a large number of public digital newspaper archives and open-source sites.

SERVICE OFFERS: Historical Data Web Scraping Solution

BUSINESS CHALLENGES

Our Client

DIGI-TEXX’s client is the world’s leading provider of family history and genealogy services, with its main office located in the US. As part of the company’s 20 years of collecting, indexing, and digitizing efforts, they currently manage almost 7 billion records. These records include immigration, military service, marriages, and much more.

Project Challenges

Tracing The Historical Profiles

When it comes to researching genealogy, the obituary is a treasure trove of crucial data. Beyond basic biographical details like name, birth, and death dates, they provide insights into geographical locations, relatives’ names, and other important data that may be difficult to find in other historical sources.

Unveil Historical Obituary Data With Web Scraping Solution

Nevertheless, collecting and processing obituary data remains a challenging task for our clients due to several factors:

  • Handling diverse and enormous data sources: A vast amount of historical obituary data is scattered across millions of digital resources from public newspapers, libraries, governments, churches, universities, and funeral home websites,…
  • Manual Inefficiency: Manually extracting and indexing different formats, data types, and complex web structures is time-consuming and costly.
  • Data Duplication: One piece of information can be stored on various resources; as a result, cleansing this data requires time and workforce.
  • Data quality assurance: Ensuring data accuracy, completeness, and consistency can be challenging due to errors, unstructured data, and missing information
  • Unstructured free text: Obituaries are written in a narrative, free-form format; our solution had to move beyond basic scraping to ‘understand’ the prose.

Project Scope

The project aims to develop a robust solution that assists clients in automatically collecting historical obituary data across digital resources. Then, the collected data will be standardized to ensure consistency and quality of information.

  • Volume: 450,000 records per URL with 60 URLs/month
  • Extracted fields in each record include a person’s name, gender, images, birthplace, age, residence, date of death, location, cause of death (due to COVID-19, death during wartime, diseases, etc.)

SOLUTION

Historical Data Web Scraping Solution

To address these challenges, DIGI-TEXX developed a sophisticated web scraping solution to automate the process of collecting and processing historical obituary data across a large number of digital public newspaper archives and open-source sites. This would enhance the database, providing its users with access to millions of new records. 

Online Historical Obituary Data Collection With Web Scraping Solution
  1. Automated Web Scraper: Our solution focuses on the architectural navigation and raw data collection from diverse online sources.
    • The engine crawls and indexes various platforms, including digital newspapers, archival websites, and public records, regardless of site structure.
    • Automatically identify and retrieve files in multiple formats (HTML, PDF, and image files) to be processed in a unified environment
  2. Natural Language Processing (NLP): Obituaries are primarily composed of unstructured free text; we deployed advanced NLP models to read and interpret the narrative content.
    • Our solution uses semantic parsing to understand the prose of a life story, accurately distinguishing between the deceased and surviving relatives.
    • Transforms narrative paragraphs into clean, categorized data fields (e.g., Cause of Death, Occupation, Education), enabling the conversion of biographical stories into searchable, actionable databases.
  3. Data Validation: The collected data was cleaned, standardized, and formatted to match the Client’s database structure

BUSINESS OUTCOME

  • Optimized data processing time:
    • 20-30 minutes for text-used URL 
    • 2-3 days for more complex URL
  • Expanded Database: Delivered over 450,000 records per URL, significantly expanding the client’s database and providing users with access to millions of historical records.
  • Improved Data Quality And Accuracy: Achieved an accuracy rate of 95%, ensuring reliable and accurate obituary data.
  • Power Machine Learning and AI Applications: Create large datasets for training machine learning models to improve accuracy and performance
Unveil Historical Obituary Data With Web Scraping Solution

RELATED CASE STUDIES

Image Processing for Enhancing Virtual Try-on AI Model

Image Processing For Virtual Try-on AI Model

DIGI-TEXX enhanced a virtual try-on AI model to upgrade AI-generated fashion visuals through advanced AI image processing and professional AI photo retouching techniques. We refined raw outputs by correcting defects like unnatural skin tones, messy hair, and inaccurate fabric textures.

Vehicle Annotation To Enhance Traffic Monitoring And AI-Powered Security System 3

Vehicle Annotation To Enhance Traffic Monitoring And AI-Powered Security System

DIGI-TEXX provided vehicle annotation to enhance traffic monitoring and AI-powered security systems. This improved automated surveillance accuracy, enabling precise vehicle classification and anomaly detection in complex environments.

Data Generation on Multiple Platforms to Build User Behavior Datasets for AI Agent Training 9

Data Generation on Multiple Platforms to Build User Behavior Datasets for AI Agent Training

DIGI-TEXX provided a large-scale data generation on multiple platforms that simulated real user interactions across online and enterprise systems

SHARE YOUR CHALLENGES