'Crawling Twitter API for specific tweets
I am trying to crawl twitter for specific keywords, which I have made into the array
keywords = ["art", "railway", "neck"]
I am trying to search for these words in a specific location, which I have written as
PLACE_LAT = 29.7604
PLACE_LON = -95.3698
PLACE_RAD = 200
I have then tried to apply a function to find at least 200 tweets, but I know that only 100 can be searched with each query. My code so far is below, however, this code did not work.
def retrieve_tweets(api, keyword, batch_count, total_count, latitude, longitude, radius):
"""
collects tweets using the Twitter search API
api: Twitter API instance
keyword: search keyword
batch_count: maximum number of tweets to collect per each request
total_count: maximum number of tweets in total
"""
# the collection of tweets to be returned
tweets_unfiltered = []
tweets = []
# the number of tweets within a single query
batch_count = str(batch_count)
'''
You are required to insert your own code where instructed to perform the first query to Twitter API.
Hint: revise the practical session on Twitter API on how to perform query to Twitter API.
'''
# per the first query, to obtain max_id_str which will be used later to query sub
resp = api.request('search/tweets', {'q': keywords,
'count': '100',
'lang':'en',
'result_type':'recent',
'geocode':'{PLACE_LAT},{PLACE_LONG},{PLACE_RAD}mi'.format(latitude, longitude, radius)})
# store the tweets in a list
# check first if there was an error
if ('errors' in resp.json()):
errors = resp.json()['errors']
if (errors[0]['code'] == 88):
print('Too many attempts to load tweets.')
print('You need to wait for a few minutes before accessing Twitter API again.')
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
# find the max_id_str for the next batch
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
# loop until as many tweets as total_count is collected
number_of_tweets = len(tweets)
while number_of_tweets < total_count:
resp = api.request('search/tweets', {'q': keywords,
'count': '50',
'lang':'en',
'result_type': 'recent',
'max_id': max_id_str,
'geocode':'{PLACE_LAT},{PLACE_LONG},{PLACE_RAD}mi'.format(latitude, longitude, radius)}
)
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets,
keyword,
tweets[number_of_tweets-1]['created_at']))
return tweets
I only needed to write code where it said # Insert your code. What changes do I need to make to get this to work
def retrieve_tweets(api, keyword, batch_count, total_count, latitude, longitude, radius):
"""
collects tweets using the Twitter search API
api: Twitter API instance
keyword: search keyword
batch_count: maximum number of tweets to collect per each request
total_count: maximum number of tweets in total
"""
# the collection of tweets to be returned
tweets_unfiltered = []
tweets = []
# the number of tweets within a single query
batch_count = str(batch_count)
'''
You are required to insert your own code where instructed to perform the first query to Twitter API.
Hint: revise the practical session on Twitter API on how to perform query to Twitter API.
'''
# per the first query, to obtain max_id_str which will be used later to query sub
resp = api.request('search/tweets', {'q': #INSERT YOUR CODE
'count': #INSERT YOUR CODE
'lang':'en',
'result_type':'recent',
'geocode':'{},{},{}mi'.format(latitude, longitude, radius)})
# store the tweets in a list
# check first if there was an error
if ('errors' in resp.json()):
errors = resp.json()['errors']
if (errors[0]['code'] == 88):
print('Too many attempts to load tweets.')
print('You need to wait for a few minutes before accessing Twitter API again.')
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
# find the max_id_str for the next batch
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
# loop until as many tweets as total_count is collected
number_of_tweets = len(tweets)
while number_of_tweets < total_count:
resp = api.request('search/tweets', {'q': #INSERT YOUR CODE
'count': #INSERT YOUR CODE
'lang':'en',
'result_type': #INSERT YOUR CODE
'max_id': max_id_str,
'geocode': #INSERT YOUR CODE
)
if ('statuses' in resp.json()):
tweets_unfiltered += resp.json()['statuses']
tweets = [tweet for tweet in tweets_unfiltered if ((tweet['retweeted'] != True) and ('RT @' not in tweet['text']))]
ids = [tweet['id'] for tweet in tweets_unfiltered]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets,
keyword,
tweets[number_of_tweets-1]['created_at']))
return tweets
Solution 1:[1]
What is your question or issue? I didn't see any in your post.
A couple of suggestions... Remove lang
and result_type
parameters from your request. Because you are using geocode
you should not expect very many results since hardly anyone turns location on when they tweet.
Also, rather than using max_id
parameter, you may want to look at the TwitterPager
class which takes care of this for you. Here is an example: https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jonas |