I'm collecting Twitter data (tweets + meta data) into a MongoDB server. Now I want to do some statistical analysis. To get the data from MongoDB into a Pandas data frame I used the following code:
cursor = collection.find({},{'id': 1, 'text': 1})
tweet_fields = ['id', 'text']
result = pd.DataFrame(list(cursor), columns = tweet_fields)
This way i successfully loaded the data into Pandas, which is great. Now I wanted to do some analysis on the users that created the tweets which was also data I collected. This data is located in a nested part of the JSON (I'm not 100% sure if this is true JSON), for instance user.id which is the id of the Twitter user account.
I can just add that to the cursor using dot notation:
cursor = collection.find({},{'id': 1, 'text': 1, 'user.id': 1})
But this results in a NaN for that column. I found that the problem lies with the way the data is structured:
bit of the cursor without user.id:
[{'_id': ObjectId('561547ae5371c0637f57769e'),
'id': 651795711403683840,
'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s'},
{'_id': ObjectId('561547bf5371c0637f5776ac'),
'id': 651795781557583872,
'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.'},
{'_id': ObjectId('561547ab5371c0637f57769c'),
'id': 651795699881889792,
'text': 'RT @ansichtssache47: Geht gef?lligst arbeiten, die #Flüchtlinge haben Hunger! http://t.co/QxUYfFjZB5 #grenzendicht #rente #ZivilerUngehorsa…'}]
bit of the cursor with user.id:
[{'_id': ObjectId('561547ae5371c0637f57769e'),
'id': 651795711403683840,
'text': 'Video: Zuuuu gut! Caro Korneli besucht für extra 3 Pegida Via KFMW http://t.co/BJX5GKrp7s',
'user': {'id': 223528499}},
{'_id': ObjectId('561547bf5371c0637f5776ac'),
'id': 651795781557583872,
'text': 'Iets voor werkloze xenofobe PVV-ers, (en dat zijn waarschijnlijk wel de meeste).........Ze zoeken bij Frontex een paar honderd grenswachten.',
'user': {'id': 3544739837}}]
So in short I don't understand how I get the nested part of my collected data in a separate column of my Pandas data frame.
See Question&Answers more detail:
os