I’ve long been using Tweepy, a great Python library that allows you to connect to the Twitter API and consume different services such as search or streaming. If you’ve used this library before, you might probably know that it quickly hits a limitation, making your data collection effort compromised.
If you visit the Twitter developer portal, you can learn about theses rate limits. They’re applied to both POST and GET requests. What personally bothered me in the past was also the limitation of 3200 tweets per user and the 7-day history limit on each given search. But these are old days now because there’s a new player in town.
Twint is a scraping tool developed in Python to extract and scrape tweets from specific users and tweets on specific topics, hashtags, geographic location, language, etc. Twint makes it also easier to scrape user’s followers, likes, and retweets.
Above all, Twint has these major benefits:
- No rate limits. It can fetch almost all the tweets. When I was interested in collecting tweets about Covid-19, I managed to gather around 1.5 million tweets over a large period of time (maybe 3 or 5 months). And this was collected in just a couple of hours only
- Twint doesn’t need any prior setup. Unlike using Tweepy, you don’t need to create a Twitter application (and wait for Twitter to approve it)
- Twint can be used anonymously since it doesn’t ask you to connect to your account or enter your API credentials
First things first, install it by cloning the repo and using the master branch. As of now, installing Twint with PIP results in issues running the search.
git clone --depth=1 https://github.com/twintproject/twint.git cd twint pip3 install . -r requirements.txt
There two ways you can use Twint:
Twint can be called from the terminal with different options:
The most important ones (at least, to me, when I used it) are
- u or username : the username of the account you’re scraping
- s or search : the search keyword or phrase you’re tracking
- l or lang : the language of the tweet (“en” to English, “fr” to French, etc)
- json : to set the export to a JSON format
- csv : to set the export to a CSV format
- limit : the number of tweets to pull
The documentation is self-explanatory, you can compose these filters and get creative 💡.
Here are some examples I ended up running while collecting tweets about Covid-19 and vaccines.
I was interested in french tweets discussing the Covid-19 vaccines, and more precisely, the tweets occurring in Paris. Using the lang and geo options was very helpful in this situation.
twint -s "covid vaccin" --lang fr --json --output data/tweets.json twint -s "covid vaccin" -g="48.880048,2.385939,5km" --json
The --g=”48.880048,2.385939,5km” filters the tweets from a radius of 5km around Paris.
Twint can also be called from a python script. You simply need to define the arguments that you passed in the terminal as attributes to a configuration object.
Here’s how it’s done:
import twint # Configure c = twint.Config() c.Search = "covid vaccin" c.Lang = "fr" c.Geo = "48.880048,2.385939,5km" c.Limit = 300 c.Output = "./test.json" c.Store_json = True # Run twint.run.Search(c)
Pretty simple, right?
Once you launch the script, you’ll see the pulled tweets scrolling on the screen.
When the script is done running after a few seconds, you can inspect the downloaded file by loading it inside a pandas dataframe. You can, therefore, check its shape first and the different columns (or metadata) that the scraper pulled.
We got 300 tweets and that’s what we asked for.
By inspecting the data, we also notice that we not only get the text for each tweet but additional metadata as well like the use_id, the language of the tweet, the creation_date etc…
We would normally get this same information from the API although we’re not using it right here. Amazing, right?
Like any other data science project, success goes along with a great workflow. Here’s what I usually do. Feel free to share your best practices in the comments:
If you’re following news or trends in a particular market, Twint can be very helpful.
By properly defining the search parameters (keywords, hashtags, date range, language, and geographic scope) you can set it up in no time to meet your search requirements.
You can schedule Twint to run periodically — every day or week and update a database. You can either create this database manually or call the elasticsearch option that creates a specific Elasticsearch index for your data.
Creating an Elasticsearch database is the most efficient way to store text data since it makes full-text queries runvery fast.
This section is usually performed at the same time as the previous one.
Extracting raw data from Twitter without any transformation rarely yields actionable results. Here’s what I recommend looking into to build additional insights.
- Perform text classification on the tweets: this can be sentiment analysis for example, which is quite frankly, the automatic task anyone can think of when scraping tweets. But you don’t have to limit yourself to that. You can train your own classifier (fake news detector, topic detector, sarcasm identifier) and apply it on your tweets
- Apply topic extraction techniques: start with LDA or NMF as baselines. Explore document embeddings and clustering methods in a second phase
- Extract named entities from the tweets. This can be helpful: think about people’s names, organizations or locations. It’s interesting to correlate these entities to the sentiment of the tweets or its topics -
- Translate the tweets: this can be helpful as a preprocessing step
- Spot the accounts which are tweeting the most. These can introduce a bias in your data. Process them separately or ignore them from your analysis. Typical examples of those accounts could be corporate accounts that share PR communication or ads
- Extract the geographic location of the tweets and cluster it
The natural step that comes after processing and storing the data is visualization.
If you’re using Elasticsearch as a database, you couldn’t be any luckier: Elasticsearch seamlessly integrates with a visualization tool called Kibana that allows you to easily build many charts.
After building a scraping and a data processing pipeline and set up Kibana to visualize the results, your project can now run in a semi-autonomous fashion. I put semi before autonomous because you still need to monitor the results such as the performance of your classifiers or topic detectors: these artefacts usually need retraining over time.
Twint is a great package to build social media monitoring applications without being blocked by the Twitter API and its rate limits.
Coupled with natural language processing techniques and visualization, Twint can also be a solution to your next data science project. Why not give it a shot then? 😉