2021年1月18日星期一

Processing data from a large data grab

I've downloaded a large (>75GB) data grab from archive.org containing most or all of the tweets from June 2020. The archive itself consists of 31 .tar files, each containing nested folders with the lowest level containing several compressed .json files. I need a way to access the data stored in this archive from my Python application. I would like to use MongoDB since its document-based database structure seems well suited to the type of data in this archive. What would be the best way of doing so?

Here is what the archive looks like:

root folder

inside folders

Any help would be appreciated.

https://stackoverflow.com/questions/65784547/processing-data-from-a-large-data-grab January 19, 2021 at 10:00AM

没有评论:

发表评论