Data Annotation and Labeling Social Media Data To Predict The Pandemic

DIGI-TEXX provided a hybrid text annotation process with human-in-the-loop, which combined the power of machine learning, natural language processing (NLP), and a team of highly skilled data annotators with advanced English and Chinese proficiency.

SERVICE OFFERS: Data Annotation

BUSINESS CHALLENGES

Our Client

DIGI-TEXX’s client is a professional from the top research universities in the heart of Tokyo, Japan. With a specialization in environmental health and spatial information science, the client conducts various research about the impact of environmental changes on humans by using machine learning and NLP.

The client has researched the application of machine learning to data from disease-related topics on social media, which can be applied to the prediction of pandemic waves.

Data Annotation and Labeling Social Media Data To Predict The Pandemic_Thumbnail

Project Challenges

Insightful Data Lies in The Daily Social Post

Fast forward to today, witnessing the COVID-19 threats to global health, social media data has received the attention of researchers. Particularly X (Twitter), which can be used to explore multiple facets in forecasting potential disease spread.

Data Annotation and Labeling Social Media Data To Predict The Pandemic 2

According to the National Library of Medicine, by collecting social media search indexes for COVID-19 symptoms, many studies have shown that new suspected cases are forecasted in advance 6–9 days or even up to 1-2 weeks earlier, compared to official records. 

Another Frontiers in Public Health Journal in 2021 examined digital data streams as early signals of COVID-19 outbreaks in Canada and the US. They found that symptoms-related posts from X (Twitter) showed the best prediction performance by predicting 100% of first waves about 2–6 days earlier than other data streams.

Despite the potential advantages of social media for research, our client has met several hurdles. The high data volume that needs to be annotated accurately coupled with tight deadlines presents a significant challenge for them.

In addition, the target platform’s data – X (Twitter), normally has short-length texts and common use of abbreviations, hashtags, etc., making it difficult to comprehend contextual information.

Project Scope

Classify, label, and categorize users’ tweets on X (Twitter) based on predefined criteria: keywords, phrases, and sentiments related to flu-like symptoms. 

  • Data Volume: The client’s sizable data, including 200,000 tweets, needs to be annotated within 2 months.
  • Language: English and Chinese language proficiency is required.
  • Ethical Considerations: Adherence to privacy regulations and ethical guidelines. 
  • Service time: 24/7

SOLUTION

Text Annotation With Natural Language Processing

DIGI-TEXX provided a hybrid text annotation service with human-in-the-loop, which combined the power of machine learning, natural language processing (NLP), and a team of highly skilled data annotators with advanced English and Chinese proficiency. This approach optimized output for the project, ensuring efficient annotation of the large dataset.

Text annotation process:

  1. Data Pre-processing:  Classify relevant categories and remove irrelevant data, duplicates, and noisy content.
  2. Keyword & Sentiment Analysis: Employ NLP techniques to analyze and identify relevant keywords and phrases related to flu-like symptoms. Utilize machine learning models to determine the sentiment associated with the extracted keywords and phrases.
  3. Data Labeling: Label a subset of the data with relevant categories: “high probability of infection” and “low probability or insufficient information”, to provide efficient data with precision for client-specific needs. 
  4. Data Quality Assurance: Our annotators conducted frequent quality assurance to monitor the accuracy and consistency of the project. In addition, a feedback loop was established to evaluate and enhance performance continuously.
  5. Export and provide the data: Deliver the annotated dataset that is compatible with the client’s systems and for further analysis and research.
Text Annotation With Natural Language Processing Process

BUSINESS OUTCOME

  • Accurately annotated 200.000 Chinese posts from X platforms
  • Complete the project within 2 months.
  • The accuracy rate: 100%
  • We provided high-quality annotated data to enhance the client’s AI algorithm accuracy and efficiency.
  • The annotated data can be used to develop more accurate and timely early warning systems for future pandemics, allowing for proactive measures to be taken.
Data Annotation and Labeling Social Media Data To Predict The Pandemic 3

RELATED CASE STUDIES

Data Preparation Service On ERP Systems Thumbnail

Data Preparation Service On ERP Systems

DIGI-TEXX’s client is a retail department store chain with over 90 locations in Germany. Our client needs clean, accurate, and accessible data to ensure proper data management in the SAP system, make informed decisions, and optimize operations.

Data Annotation and Labeling Social Media Data To Predict The Pandemic

Data Annotation and Labeling Social Media Data To Predict The Pandemic

DIGI-TEXX provided a robust text annotation service with human-in-the-loop, which combined the power of machine learning, natural language processing (NLP)...

Historical Obituary Data Collection With Web Scraping Solution

Online Historical Obituary Data Collection With Web Scraping Solution

A web scraping solution to automate collecting and processing historical obituary data across public digital newspaper archives and open-source sites.

SHARE YOUR CHALLENGES