BUSINESS CHALLENGES
Our Client
DIGI-TEXX’s client is a professional from the top research universities in the heart of Tokyo, Japan. With a specialization in environmental health and spatial information science, the client conducts various research about the impact of environmental changes on humans by using machine learning and NLP.
The client has researched the application of machine learning to data from disease-related topics on social media, which can be applied to the prediction of pandemic waves.
Project Challenges
Insightful Data Lies in The Daily Social Post
Fast forward to today, witnessing the COVID-19 threats to global health, social media data has received the attention of researchers. Particularly X (Twitter), which can be used to explore multiple facets in forecasting potential disease spread.
According to the National Library of Medicine, by collecting social media search indexes for COVID-19 symptoms, many studies have shown that new suspected cases are forecasted in advance 6–9 days or even up to 1-2 weeks earlier, compared to official records.
Another Frontiers in Public Health Journal in 2021 examined digital data streams as early signals of COVID-19 outbreaks in Canada and the US. They found that symptoms-related posts from X (Twitter) showed the best prediction performance by predicting 100% of first waves about 2–6 days earlier than other data streams.
Despite the potential advantages of social media for research, our client has met several hurdles. The high data volume that needs to be annotated accurately coupled with tight deadlines presents a significant challenge for them.
In addition, the target platform’s data – X (Twitter), normally has short-length texts and common use of abbreviations, hashtags, etc., making it difficult to comprehend contextual information.
Project Scope
Classify, label, and categorize users’ tweets on X (Twitter) based on predefined criteria: keywords, phrases, and sentiments related to flu-like symptoms.
- Data Volume: The client’s sizable data, including 200,000 tweets, needs to be annotated within 2 months.
- Language: English and Chinese language proficiency is required.
- Ethical Considerations: Adherence to privacy regulations and ethical guidelines.
- Service time: 24/7
SOLUTION
Text Annotation With Natural Language Processing
DIGI-TEXX provided a hybrid text annotation service with human-in-the-loop, which combined the power of machine learning, natural language processing (NLP), and a team of highly skilled data annotators with advanced English and Chinese proficiency. This approach optimized output for the project, ensuring efficient annotation of the large dataset.
Text annotation process:
- Data Pre-processing: Classify relevant categories and remove irrelevant data, duplicates, and noisy content.
- Keyword & Sentiment Analysis: Employ NLP techniques to analyze and identify relevant keywords and phrases related to flu-like symptoms. Utilize machine learning models to determine the sentiment associated with the extracted keywords and phrases.
- Data Labeling: Label a subset of the data with relevant categories: “high probability of infection” and “low probability or insufficient information”, to provide efficient data with precision for client-specific needs.
- Data Quality Assurance: Our annotators conducted frequent quality assurance to monitor the accuracy and consistency of the project. In addition, a feedback loop was established to evaluate and enhance performance continuously.
- Export and provide the data: Deliver the annotated dataset that is compatible with the client’s systems and for further analysis and research.
BUSINESS OUTCOME
- Accurately annotated 200.000 Chinese posts from X platforms
- Complete the project within 2 months.
- The accuracy rate: 100%
- We provided high-quality annotated data to enhance the client’s AI algorithm accuracy and efficiency.
- The annotated data can be used to develop more accurate and timely early warning systems for future pandemics, allowing for proactive measures to be taken.