I extended my WebHDFS downloader. In the case of HDFS directories having an enormous number of files, scanning the directory using WebHDFS REST API fails because of time-out.
The solution is to obtain the HDFS directory tree using the standard hdfs dfs -ls -R command, ship the result to the node where hdfsdownloader is executed and use the file as the input for the tool.
More detailed description: https://github.com/stanislawbartkowski/webhdfsdirectory
To extend the tool, I used the inheritance feature of Python. The tool now is running in two modes, get the list of HDFS files using WebHDFS REST/API or get the list from the input text file. So the only difference is the way of obtaining the list of files, the main tool flow is the same regardless of the method.
Source code: https://github.com/stanislawbartkowski/webhdfsdirectory/blob/main/src/proc/hdfs.py
Class ClassHDFS runs the application flow. It calls getdir method to receive a list of names of files and directories in the HDFS path provided as a parameter and iterates over the list. For a regular file, the file is downloaded. For the directory entry, the class makes a recursive call and goes one level down in HDFS directory tree.
getdir method is defined in an appropriate inherited class. In the case of WebHDFS REST/API mode, the method inherits from TRAVERSEHDFS class. In the case of input text file, the method is implemented in TRAVERSEFILE class.
class DIRHDFS(CLASSHDFS, TRAVERSEHDFS): def __init__(self, WEBHDFSHOST, WEBHDFSPORT, WEBHDFSUSER, DIRREG=None, dryrun=False): TRAVERSEHDFS.__init__(self, WEBHDFSHOST, WEBHDFSPORT, WEBHDFSUSER) CLASSHDFS.__init__(self, DIRREG, dryrun) class FILEHDFS(CLASSHDFS, TRAVERSEFILE): def __init__(self, txtfile,WEBHDFSHOST, WEBHDFSPORT, WEBHDFSUSER, DIRREG=None, dryrun=False): self.WEBHDFSHOST = WEBHDFSHOST self.WEBHDFSPORT = WEBHDFSPORT self.WEBHDFSUSER = WEBHDFSUSER TRAVERSEFILE.__init__(self, txtfile) CLASSHDFS.__init__(self, DIRREG, dryrun)