Extracting Twitter Data Using Python

Twitter is not just an extended source of news, it is by far one of the best samples of the world’s thoughts. With more than 330 million active users, it is one of the top platforms where people like to share their thoughts. Twitter data can be used for a variety of purposes such as research, consumer insights, demographic insights and many more.

Hence, the primary aim of this tutorial is teach you how to get a sample of tweets that are relevant for your project or business.

Before proceeding, make sure you have all of these variables handy:

  1. Consumer Key
  2. Consumer Secret
  3. Access Token
  4. Access Token Secret

If you want to know how to acquire the above-mentioned details, go read that blog post written by my colleague Dattatray Upase.

Now let’s do some coding!

Defining the input variables

First, you have to define some of the global variables that you would need for the program:

I am importing ‘sys’ to get command line arguments, because I might want to change keywords, start-date or end-date. For language I picked English, but you might want to check what other languages are supported. The results will be stored in ‘data’ at the end.

As a result, a typical usage of the script would be like this:

python script.py start_date end_date keyword

Accessing the Twitter API
Twitter allows the usage of their API via an oauth2 authorization framework. They provide the documentation on their website with which filtering of tweets can be done. The piece of code above would implement it for you.

The req_count variable is the number of times I have used the API in the execution of my program. I am facing the following error with the code given:

TypeError: Unicode-objects must be encoded before hashing

In order to avoid this, I am changing post_body=”” to post_body=b”” and it fixes the problem.

Twitter API Usage and Reference

It’s time to set the API URL to get the data. I am using the parameter min_faves. Here’s the explanation of the URL and some optimization tricks:

‘min_faves’ is used to set the minimum number of favorites a tweet should have in the data. It’s a very useful feature but it’s not mentioned in the Twitter API documentation.

‘q’ represents the query or the keywords you want to enter. Here it’s important to make sure that you give as few keywords as possible. For example, let’s imagine that I want tweets about Facebook and Google. If I give both as keywords, say, FACEBOOK and GOOGLE, it’s only going to return me 100 tweets max, since that’s a restriction. But if I run the query two times – once with Facebook and once with Google, I can get a total of 200 tweets. Long story short, it’s better to use one keyword per query.

‘lang’ represents the language of the filtered tweets. Since I want to get tweets in English, I am setting it to ‘en’.

‘since’ is the start date of the period from which you want to look for tweets. This start date should be from the last 7 days. This is another feature which is not documented in Twitter API Documentation.

‘until’ represents the end date of your desired period. Logically, it should also be from the last 7 days. It’s also not documented on Twitter API Documentation.

‘result_type’ represents the kind of tweets you want. It has 3 values:

‘recent’ gives the most recent tweets, i.e. the tweets at the end of the selected period.

‘popular’ gives the most popular tweets and hence it misses a lot of tweets. You would always get the tweets with the top faves and retweets. The min_faves feature would be of no use here.

‘mixed’ gives a mix of recent and popular tweets.

‘count’ represents the maximum number of tweets in the result. Default is set to 15 and the maximum is 100.

With the mixed result_type and usage of min_faves, we can get the maximum tweets running the query multiple times.

For more documented features, you can also check out the Twitter’s API Documentation.

Saving/Autosaving the Retrieved Tweet Data

As a next step, you need to define an autosave/save method which has a parameter ‘saveOverride’. This step is simply needed to remove the autosave time restriction and save the file. In order to do this, I create a t_last to save the start time of the program. Then I access the same in the program and check if it has been 5 minutes since the t_last (last save time). If it has been more than 5 minutes, I mark the ‘saveStatus’ as True.

Next, I check for ‘saveOverride’, which simply means that I need to give instruction to my program that no matter what, the file should be saved now. For this, I set the saveStatus to be True.

Then if ‘saveStatus’ is True, the script will change the t_last to the current time. After that the code creates a dictionary object and prints “Autosave at [time]” so that you know that the data is being autosaved.

Next, I am checking if the output file already exists. If it does, I am combining the data and the data from an already saved file. After combining, I am writing it to the same file. If it doesn’t exist, I am creating a new file and then writing the data to that file.

As a result, I have coded almost the entire functionality I wanted.

Optimizing Further

Finally, it’s time to make use of these functions. I am writing a while(1) loop which means while(True). This basically makes the instruction run forever till some ‘break’ statement is called or some exception is raised.

First, I collect the tweets data in dictionary  ‘d’ using json.loads which converts the data into a dictionary format. Then, I run a try-catch/except on the block of code extracting statuses from the data. I use try-catch because sometimes Twitter API doesn’t return data but a JSON mentioning the error. I don’t want my program to stop in such cases. Also, I want it to retrieve at what request number that’s happening and to save my data using the autosave command. Twitter allows us to make 180 requests per 15 minutes. That’s like 12 requests per minute or one request every five seconds. Just to be safe, I add a sleep command to make my program sleep for 5 seconds after executing one iteration.

After that, the code will display the number of tweets the script has collected so far.

Finally, it’s time for the major optimization trick. I was testing this script for almost a week and I got the following number of tweets for each min_faves value. I can get a maximum of 100 tweets per request and I want to get as many as possible. Currently, there are not many tweets with higher values of min_faves but we want to account for the times when, perhaps, the company or the keyword is trending. The max value of min_faves can be 999999.

min_faves Value Number of Tweets
100,000 1
90,000 1
80,000 1
70,000 2
60,000 3
50,000 6
40,000 6
30,000 12
25,000 12

Therefore, I am using a logic that would get tweets from min_faves value of 60000 and then decreases it by 10000 each time until it reaches 10000. But if say, the keyword is trendy and I get 100 tweets when I am working with the value of min_faves to be 30000, it’s going to increase the min_faves to 35000 and then get the results again. So now the new logic is 5000 instead of 10000. However, if the change decreases to less than 1000, I ask it to ignore and go ahead with subtracting 1000.

I am specifying a fixed interval of 1000 that min_faves should decrease by if min_faves is less than or equal to 10000.

At the end of the program, the program will let you know that the work is done by displaying ‘End’.

You can find the entire code on GitHub.

That’s all. In the next twitter data tutorial I am going to teach you how to retrieve real-time tweets using the big data tool ‘Flume’. Stay tuned!

About Akshay Nagpal

Akshay is a big data developer at Nokia R&D.

Leave a Reply