2021年5月7日星期五

Issue processing large JSON request into multiple files

I'm trying to get a request from Elasticsearch, and from that request, return it as a JSON response, process that JSON response, hash a number of fields within that response, and then break that JSON response, which is exceptionally large, into many files for ML processing. Normally I would just request, process, and ship API, but I am required to write the data to static files.

My attempts so far has have led me to believe there are 2-3 potential solutions, one is to break the JSON request itself into many pieces via TimeDelta, and just make many smaller requests, but that would likely generate many files.

The second/third is to use what I was told is chunking, which I think would most likely be implemented via the snippet I put together below with open, using files, and an iterating loop, or using streams=true with requests, for which I found this regarding that solution: How to get large size JSON files using requests module in Python

Im not familiar with using chunking with requests, so any advice if that is the best option would be great.

Ive been working on this on and off for a while now as priorities have continued to shift, but I would love to close this item out so I can focus on my main projects, so any assistance anyone could provide would be great!

my_dict = resp.json()  ​  ​  x = 0  y = 0  listfile = "listfile{}.txt".format(y)  with open(listfile, "w+") as file:      for index, resp in enumerate(my_dict["hits"]["hits"]):          # ACCOUNT NUMBER          path = my_dict["hits"]["hits"][index]["_source"]["account_number"]          hashed = hashlib.md5(path.encode()).hexdigest()          my_dict["hits"]["hits"][index]["_source"]["account_number"] = hashed                    # CARDHOLDER NAME          path = my_dict["hits"]["hits"][index]["_source"]["cardHolderName"]          hashed = hashlib.md5(path.encode()).hexdigest()  ​          my_dict["hits"]["hits"][index]["_source"]["cardHolderName"] = hashed  ​          # CARDHOLDER ADDRESS          path = my_dict["hits"]["hits"][index]["_source"]["cardHolderAddress"]          hashed = hashlib.md5(path.encode()).hexdigest()  ​          my_dict["hits"]["hits"][index]["_source"]["cardHolderAddress"] = hashed  ​          # PRESENTATION INSTRUMENT ID          path = my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_id"]          hashed = hashlib.md5(path.encode()).hexdigest()  ​          my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_id"] = hashed  ​          # PRESENTATION INSTRUMENT IDENTIFIER          path = my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_identifier"]          hashed = hashlib.md5(path.encode()).hexdigest()  ​          my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_identifier"] = hashed   ​                file.seek(0)      json.dump(my_dict, file, indent=4)        enter code here  

################# Begin Snippet #2 #########################

# chunking learning attempt (Breaking 1 list or large set of data into many files)  ​  # define list of places  places = ['Berlin', 'Cape Town', 'Sydney', 'Moscow']  x = 0  y = 0  listfile = "listfile{}.txt".format(y)  ​  with open(listfile, 'w') as filehandle:      for listitem in places:          filehandle.write('%s\n' % listitem)          x += 1          # checking to see if x is iterating by 1          print(x)  ​          # checking to see if the filename is the same or iterating          print(listfile)  ​          # checking to see if the y value is same or iterating          print("this is the current y: {}".format(y))  ​          # checking to see x value, if x is equal to 3, then interate y by 1, which should change filename from listfile0 to listfile1, etc, and changes x back to 0, and continues loop until all items are processed.          if x == 3:              y += 1              listfile = "listfile{}.txt".format(y)              x = 0              print(y)              continue  
https://stackoverflow.com/questions/67443991/issue-processing-large-json-request-into-multiple-files May 08, 2021 at 12:06PM

没有评论:

发表评论