{"id":23436,"date":"2024-09-13T10:36:01","date_gmt":"2024-09-13T03:36:01","guid":{"rendered":"https:\/\/digi-texx.com\/?post_type=case-studies&#038;p=23436"},"modified":"2026-05-21T10:32:58","modified_gmt":"2026-05-21T03:32:58","slug":"historical-obituary-data-with-web-scraping-solution","status":"publish","type":"case-studies","link":"https:\/\/digi-texx.com\/ja\/case-studies\/historical-obituary-data-with-web-scraping-solution\/","title":{"rendered":"Online Historical Obituary Data Collection With Web Scraping Solution"},"content":{"rendered":"<div class=\"gb-container gb-container-049d4be1\"><div class=\"gb-inside-container\">\n\n<h2 class=\"gb-headline gb-headline-9ac0d6d3 gb-headline-text\"><span class=\"ez-toc-section\" id=\"BUSINESS_CHALLENGES\"><\/span>BUSINESS CHALLENGES<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<h3 class=\"gb-headline gb-headline-2e78daf4 gb-headline-text\"><span class=\"ez-toc-section\" id=\"Our_Client\"><\/span><strong><strong><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\"><strong>Our Client<\/strong><\/span><\/strong><\/strong><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>DIGI-TEXX\u2019s client is the world&#8217;s leading provider of family history and genealogy services, with its main office located in the US. As part of the company&#8217;s 20 years of collecting, indexing, and digitizing efforts, they currently manage almost 7 billion records. These records include immigration, military service, marriages, and much more.<\/p>\n\n\n\n<h3 class=\"gb-headline gb-headline-fe55f590 gb-headline-text\"><span class=\"ez-toc-section\" id=\"Project_Challenges\"><\/span><strong><strong><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\"><strong>Project Challenges<\/strong><\/span><\/strong><\/strong><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p><strong><em><strong><em><span style=\"color: var(--accent);\" class=\"stk-highlight\">Tracing The Historical Profiles<\/span><\/em><\/strong><\/em><\/strong><\/p>\n\n\n\n<p>When it comes to researching genealogy, the obituary is a treasure trove of crucial data. Beyond basic biographical details like name, birth, and death dates, they provide insights into geographical locations, relatives&#8217; names, and other important data that may be difficult to find in other historical sources.<\/p>\n\n\n<style>.kb-image23436_597401-cd .kb-image-has-overlay:after{opacity:0.3;}<\/style>\n<div class=\"wp-block-kadence-image kb-image23436_597401-cd\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"740\" height=\"416\" src=\"https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/01.-Web-Scraping.jpg\" alt=\"Unveil Historical Obituary Data With Web Scraping Solution\" class=\"kb-img wp-image-23441\" title=\"\" srcset=\"https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/01.-Web-Scraping.jpg 740w, https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/01.-Web-Scraping-300x169.jpg 300w\" sizes=\"auto, (max-width: 740px) 100vw, 740px\" \/><\/figure><\/div>\n\n\n\n<p>Nevertheless, collecting and processing obituary data remains a challenging task for our clients due to several factors:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Handling diverse and enormous data sources:<\/span> <\/strong>A vast amount of historical obituary data is scattered across millions of digital resources from public newspapers, libraries, governments, churches, universities, and funeral home websites,&#8230;<\/li>\n\n\n\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Manual Inefficiency:<\/span> <\/strong>Manually extracting and indexing different formats, data types, and complex web structures is time-consuming and costly.<\/li>\n\n\n\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Data Duplication<\/span><\/strong>: One piece of information can be stored on various resources; as a result, cleansing this data requires time and workforce.<\/li>\n\n\n\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Data quality assurance:<\/span> <\/strong>Ensuring data accuracy, completeness, and consistency can be challenging due to errors, unstructured data, and missing information<\/li>\n\n\n\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Unstructured free text<\/span><\/strong>: Obituaries are written in a narrative, free-form format; our solution had to move beyond basic scraping to &#8216;understand&#8217; the prose.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"gb-headline gb-headline-25fbbbd3 gb-headline-text\"><span class=\"ez-toc-section\" id=\"Project_Scope\"><\/span><strong><strong><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Project Scope<\/span><\/strong><\/strong><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>The project aims to develop a robust solution that assists clients in automatically collecting historical obituary data across digital resources. Then, the collected data will be standardized to ensure consistency and quality of information.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Volume: 450,000 records per URL with 60 URLs\/month<\/li>\n\n\n\n<li>Extracted fields in each record include a person&#8217;s name, gender, images, birthplace, age, residence, date of death, location, cause of death (due to COVID-19, death during wartime, diseases, etc.)<\/li>\n<\/ul>\n\n<\/div><\/div>\n\n<div class=\"gb-container gb-container-540b5898\"><div class=\"gb-inside-container\">\n\n<h2 class=\"gb-headline gb-headline-c2b72c8c gb-headline-text\"><span class=\"ez-toc-section\" id=\"SOLUTION\"><\/span><strong><strong>SOLUTION<\/strong><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<h3 class=\"gb-headline gb-headline-91203dbc gb-headline-text\"><span class=\"ez-toc-section\" id=\"Historical_Data_Web_Scraping_Solution\"><\/span><strong><strong><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\"><span style=\"color: var(--accent);\" class=\"stk-highlight\"><strong>Historical Data Web Scraping Solution<\/strong><\/span><\/span><\/strong><\/strong><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>To address these challenges, DIGI-TEXX developed a sophisticated web scraping solution to automate the process of collecting and processing historical obituary data across a large number of digital public newspaper archives and open-source sites. This would enhance the database, providing its users with access to millions of new records.\u00a0<\/p>\n\n\n<style>.kb-image23436_8253b8-20 .kb-image-has-overlay:after{opacity:0.3;}<\/style>\n<div class=\"wp-block-kadence-image kb-image23436_8253b8-20\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/Web-Scraping-Chart-1024x576.jpg\" alt=\"Online Historical Obituary Data Collection With Web Scraping Solution\" class=\"kb-img wp-image-23468\" title=\"\" srcset=\"https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/Web-Scraping-Chart-1024x576.jpg 1024w, https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/Web-Scraping-Chart-300x169.jpg 300w, https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/Web-Scraping-Chart-768x432.jpg 768w, https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/Web-Scraping-Chart-1536x864.jpg 1536w, https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/Web-Scraping-Chart-2048x1152.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"has-children\"><span class=\"list-item-text\"><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Automated Web Scraper<\/span>: <\/strong>Our solution focuses on the architectural navigation and raw data collection from diverse online sources.\n<\/span><ul class=\"wp-block-list\">\n<li>The engine crawls and indexes various platforms, including digital newspapers, archival websites, and public records, regardless of site structure.<\/li>\n\n\n\n<li>Automatically identify and retrieve files in multiple formats (HTML, PDF, and image files) to be processed in a unified environment<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li class=\"has-children\"><span class=\"list-item-text\"><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Natural Language Processing (NLP<\/span>)<\/strong>: Obituaries are primarily composed of unstructured free text; we deployed advanced NLP models to read and interpret the narrative content.\n<\/span><ul class=\"wp-block-list\">\n<li>Our solution uses semantic parsing to understand the prose of a life story, accurately distinguishing between the deceased and surviving relatives.<\/li>\n\n\n\n<li>Transforms narrative paragraphs into clean, categorized data fields (e.g., Cause of Death, Occupation, Education), enabling the conversion of biographical stories into searchable, actionable databases.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Data Validation:<\/span> <\/strong>The collected data was cleaned, standardized, and formatted to match the Client\u2019s database structure<\/li>\n<\/ol>\n\n<\/div><\/div>\n\n<div class=\"gb-container gb-container-3c64cdaf\"><div class=\"gb-inside-container\">\n<div class=\"gb-grid-wrapper gb-grid-wrapper-84dc8722\">\n<div class=\"gb-grid-column gb-grid-column-31652cd0\"><div class=\"gb-container gb-container-31652cd0\"><div class=\"gb-inside-container\">\n\n<h2 class=\"gb-headline gb-headline-6c0964bb gb-headline-text\"><span class=\"ez-toc-section\" id=\"BUSINESS_OUTCOME\"><\/span>BUSINESS OUTCOME<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-children\"><span class=\"list-item-text\"><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Optimized data processing time:<\/span><\/strong>\n<\/span><ul class=\"wp-block-list\">\n<li><span style=\"color: var(--accent);\" class=\"stk-highlight\">20-30 minutes<\/span> for text-used URL&nbsp;<\/li>\n\n\n\n<li><span style=\"color: var(--accent);\" class=\"stk-highlight\">2-3 days<\/span> for more complex URL<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Expanded Database<\/span>:<\/strong> Delivered over <span style=\"color: var(--accent);\" class=\"stk-highlight\">450,000 records per URL<\/span>, significantly expanding the client&#8217;s database and providing users with access to millions of historical records.<\/li>\n\n\n\n<li><strong><span style=\"color: var(--accent);\" class=\"stk-highlight\">Improved Data Quality And Accuracy<\/span><\/strong>: Achieved an accuracy rate of <span style=\"color: var(--accent);\" class=\"stk-highlight\">95%<\/span>, ensuring reliable and accurate obituary data.<\/li>\n\n\n\n<li><span style=\"color: var(--accent);\" class=\"stk-highlight\"><strong>Power Machine Learning and AI Applications<\/strong><\/span>: Create large datasets for training machine learning models to improve accuracy and performance<\/li>\n<\/ul>\n\n<\/div><\/div><\/div>\n\n<div class=\"gb-grid-column gb-grid-column-0123e88f\"><div class=\"gb-container gb-container-0123e88f\"><div class=\"gb-inside-container\">\n\n<figure class=\"gb-block-image gb-block-image-d804f78c\"><img loading=\"lazy\" decoding=\"async\" width=\"740\" height=\"416\" class=\"gb-image gb-image-d804f78c\" src=\"https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/02.-Web-Scraping.jpg\" alt=\"Unveil Historical Obituary Data With Web Scraping Solution\" title=\"02. Web Scraping\" srcset=\"https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/02.-Web-Scraping.jpg 740w, https:\/\/digi-texx.com\/wp-content\/uploads\/2024\/09\/02.-Web-Scraping-300x169.jpg 300w\" sizes=\"auto, (max-width: 740px) 100vw, 740px\" \/><\/figure>\n\n<\/div><\/div><\/div>\n<\/div>\n<\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>A web scraping solution to automate collecting and processing historical obituary data across public digital newspaper archives and open-source sites. <\/p>\n","protected":false},"featured_media":23480,"template":"","industries":[59],"class_list":["post-23436","case-studies","type-case-studies","status-publish","has-post-thumbnail","hentry","industries-historical-archive"],"acf":[],"_links":{"self":[{"href":"https:\/\/digi-texx.com\/ja\/wp-json\/wp\/v2\/case-studies\/23436","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/digi-texx.com\/ja\/wp-json\/wp\/v2\/case-studies"}],"about":[{"href":"https:\/\/digi-texx.com\/ja\/wp-json\/wp\/v2\/types\/case-studies"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/digi-texx.com\/ja\/wp-json\/wp\/v2\/media\/23480"}],"wp:attachment":[{"href":"https:\/\/digi-texx.com\/ja\/wp-json\/wp\/v2\/media?parent=23436"}],"wp:term":[{"taxonomy":"industries","embeddable":true,"href":"https:\/\/digi-texx.com\/ja\/wp-json\/wp\/v2\/industries?post=23436"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}