I'm trying to get a request from Elasticsearch, and from that request, return it as a JSON response, process that JSON response, hash a number of fields within that response, and then break that JSON response, which is exceptionally large, into many files for ML processing. Normally I would just request, process, and ship API, but I am required to write the data to static files.
My attempts so far has have led me to believe there are 2-3 potential solutions, one is to break the JSON request itself into many pieces via TimeDelta, and just make many smaller requests, but that would likely generate many files.
The second/third is to use what I was told is chunking, which I think would most likely be implemented via the snippet I put together below with open, using files, and an iterating loop, or using streams=true with requests, for which I found this regarding that solution: How to get large size JSON files using requests module in Python
Im not familiar with using chunking with requests, so any advice if that is the best option would be great.
Ive been working on this on and off for a while now as priorities have continued to shift, but I would love to close this item out so I can focus on my main projects, so any assistance anyone could provide would be great!
my_dict = resp.json() x = 0 y = 0 listfile = "listfile{}.txt".format(y) with open(listfile, "w+") as file: for index, resp in enumerate(my_dict["hits"]["hits"]): # ACCOUNT NUMBER path = my_dict["hits"]["hits"][index]["_source"]["account_number"] hashed = hashlib.md5(path.encode()).hexdigest() my_dict["hits"]["hits"][index]["_source"]["account_number"] = hashed # CARDHOLDER NAME path = my_dict["hits"]["hits"][index]["_source"]["cardHolderName"] hashed = hashlib.md5(path.encode()).hexdigest() my_dict["hits"]["hits"][index]["_source"]["cardHolderName"] = hashed # CARDHOLDER ADDRESS path = my_dict["hits"]["hits"][index]["_source"]["cardHolderAddress"] hashed = hashlib.md5(path.encode()).hexdigest() my_dict["hits"]["hits"][index]["_source"]["cardHolderAddress"] = hashed # PRESENTATION INSTRUMENT ID path = my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_id"] hashed = hashlib.md5(path.encode()).hexdigest() my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_id"] = hashed # PRESENTATION INSTRUMENT IDENTIFIER path = my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_identifier"] hashed = hashlib.md5(path.encode()).hexdigest() my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_identifier"] = hashed file.seek(0) json.dump(my_dict, file, indent=4) enter code here
################# Begin Snippet #2 #########################
# chunking learning attempt (Breaking 1 list or large set of data into many files) # define list of places places = ['Berlin', 'Cape Town', 'Sydney', 'Moscow'] x = 0 y = 0 listfile = "listfile{}.txt".format(y) with open(listfile, 'w') as filehandle: for listitem in places: filehandle.write('%s\n' % listitem) x += 1 # checking to see if x is iterating by 1 print(x) # checking to see if the filename is the same or iterating print(listfile) # checking to see if the y value is same or iterating print("this is the current y: {}".format(y)) # checking to see x value, if x is equal to 3, then interate y by 1, which should change filename from listfile0 to listfile1, etc, and changes x back to 0, and continues loop until all items are processed. if x == 3: y += 1 listfile = "listfile{}.txt".format(y) x = 0 print(y) continue
https://stackoverflow.com/questions/67443991/issue-processing-large-json-request-into-multiple-files May 08, 2021 at 12:06PM
没有评论:
发表评论