Skip to content

lfr4704/c4v

 
 

Repository files navigation

To run the front-end in mac for map visualization

to run python -m SimpleHTTPServer 8080
open web browser to localhost:8080
to end control^c

References for visualization

https://marcobonzanini.com/2015/06/16/mining-twitter-data-with-python-and-js-part-7-geolocation-and-interactive-maps/ https://leafletjs.com/examples/geojson/ https://leafletjs.com/reference-1.4.0.html#geojson-coordstolatlng https://chrisalbon.com/python/other/mine_a_twitter_hashtags_and_words/ https://blog.prototypr.io/interactive-maps-with-python-part-1-aa1563dbe5a9

References for twitter API

https://developer.twitter.com/en/docs/tweets/enrichments/overview

References for Tweepy

https://tweepy.readthedocs.io/en/v3.5.0/getting_started.html

Reference for examples

https://github.com/chrisalbon/twitter_miner

Other references

https://jsoneditoronline.org/ https://stackoverflow.com/questions/35983078/leaflet-swap-the-coordinates-when-use-geojson

Using Twitter Data to Help Public Health

Challenge

The ultimate goal of this challenge is to create predictive models using Twitter data to:

  1. Detect epidemic outbreaks.
  2. Identify where a particular medicine is needed in the country.

NOTE: We don't expect this challenge to be completed during the Hackathon, the goal is to set the foundational work that will allow us to solve this problem.

Background

Who we are

About Dr. Julio Castro

Dr. Julio Castro is a Venezuelan doctor and professor specializing in infectious diseases. He is one of the most important activists focusing on the public health sector. Since 2014, when the official data was Dr. Castro and his team gather health statistics using a digital tool that allows them to monitor hospitals across the country.

His work has been widely covered in both national and international media:

About Medicos por la Salud

Medicos por la Salud is a network of doctors distributed across the country that has been monitoring hospitals and publishing a yearly report (Encuesta Nacional de Hospitales) that tries to quantify and communicate the magnitude of the public health crisis in Venezuela.

Problem Statement

Venezuela has seen an unprecedented increase in cases of malaria, measles, diphtheria and zika.

References

Infectious Diseases Spike amid Venezuela's Political Turmoil

Venezuela crisis threatens disease epidemic across continent - experts

On top of this, medicine scarcity is one of the most severe problems that Venezuela faces today. Hospitals can go without essential medications like adrenaline, insulin, and anesthetics for months at a time.

Dr. Julio Castro and his team have been conducting surveys in hospitals nationwide to keep track of where specific medications are most needed. They have also been collecting a database of more than 1 million tweets that contain medical supply requests, emergency data, and other indicators.

They believe this data, combined, may be used for the following:

  1. Making informed decisions on where to deploy limited quantities of specific medications brought as humanitarian aid.
  2. Detecting disease outbreaks in real-time. These reports would otherwise take weeks to go through the official channels, making it really hard to react to the outbreaks, which costs many lives in the process.

Challenge

Dr. Julio Castro has shared the Twitter data they have been collecting. Code For Venezuela has done an initial analysis of this data. Below you can find the results of that study to give you some context about the problem.

Data Summary

We have a little bit over 1M tweets containing hashtags related to #ServicioPublico, which is a popular hashtag used in Venezuela whenever a specific medicine or medical treatment is being searched by Venezuelans. It provides a way to connect people who need to find specific medicines with people in different parts of the country that might have access to that medicine.

Field Name Field Description
tweet_date Time when the tweet was created in Pacific Timezone.
tweet_text Content of the tweet.
tweet_url Link to this tweet on twitter.com.
hash_tags List of hash tags present on the tweet (computed by us by parsing the tweet).
raw_tweet Raw HTML gotten when the tweet was originally ingested.

Raw data is available here.

Data Limitations

We found the following issues with the data:

  • Data might be incomplete. According to the Twitter API descriptions, we believe that the data ingested by Dr. Castro might be only a sample of the data from the hashtag specified above. This is because you need access to the Enterprise or Premium APIs to get all tweets matching a specific query instead of a sample of them.
  • There are duplicates in this data, meaning that popular tweets will be over-represented.
  • We don't have any labeled data that we could use to correlate these tweets to, for example, outbreak diseases.

This should be problems that could be handled during the hackathon by different teams.

We are trying to explore possible ways to get access to Twitter premium APIs, but this is currently a work in progress. If we do get access to it, participants could leverage the Twitter Historical API and the metadata that comes with it to work on this problem

Proposed Challenges

Due to the issues highlighted above, we are proposing the following set of challenges that use and are inspired by this data:

1. Data Ingestion Pipeline

Create a data pipeline that would keep ingesting these tweets and that can potentially use Twitter's premium APIs to keep an up to date stream of #ServicioPublico tweets.

This pipeline should provide a reliable way of ingesting Twitter data for #ServicioPublico and ideally should provide a way to find whenever any issues arise during ingestion.

2. Data set enrichment

The data in its current state does not have enough information to create predictive models so we need to extend it and analyze it. Here are some ideas in this direction:

  • Tweet De-duplication: as a first step, how do we remove duplicated tweets so that posterior analysis are not misleading due to some popular tweets.
  • Change Point Detection: use Change Point Detection to determine times when these tweets suddenly become more common.
  • NLP analysis: Build a tool/pipeline that, given the data from Twitter, can understand whether a specific tweet is requesting a specific medicine so that identical medicines can be grouped together. This would require using NLP on tweets in Spanish.

A possible outcome of NLP could be providing descriptive statistics of medicine requests vs others (beds, materials, electricity).

  • Geolocate the Tweet: At the moment, tweets do not have geolocation data. Can you find ways to get this information (e.g querying Twitter API, inferring location by detecting locations or users, etc)
  • Medicine to disease mapping: Once we get information about medicines in a tweet, we will need to create training data set that maps those tweets to diseases that are cured with those medicines.

3. Data Visualization

Visualize the data given to see whether some patterns emerge over time. One motivating question would be: which medicines are requested more often during specific times?

Visualization can aid in detecting points in time where the number of tweets requesting medicine and or medical equipment has significantly changed.

4. Predictive Models

Once the data is curated and traditional ML techniques can be applied to it, the ultimate goal of this project is to create predictive models that using Twitter data to:

  1. Detect epidemic outbreaks.
  2. Identify where a particular medicine is needed in the country.

NOTE: Projects 1, 2, 3 are orthogonal and you don't need to solve them all. They could be split between different teams in the Hackathon. Project 4 is aspirational and depends on 2.

Skills Required

  1. Data engineering
  2. Google Cloud storage
  3. Column Data Store
  4. Airflow
  5. Machine Learning
  6. NLP

Project contact

Dr. Julio Castro, @juliocastrom

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 87.5%
  • JavaScript 6.8%
  • HTML 5.7%