
Your tar is probably not like my tar, so give a glance here first.
/tar-56a763365f9b58b7d0ea0473.png)
If you look in the Resources link below, you’ll notice different format strings to specify different kinds of compression. Notice we use ‘r’ for read, and ‘gz’ when it’s a gzipped file. For example, I was doing something similar with containers, and for larger ones my computer would slow and then poop out. Keep memory in mind as you parse through archives. Playing with Python, but I think this is some really cool beans. Why did I want to write this? Maybe I just really like data structures, or working with file objects, or just I want to note that I didn’t test the above, I wrote it while I was making dinner (and if you testĪnd have improvements please comment and contribute!) append ( tex ) return extractedĪnd of course and at the end, you would want to do something with the contents that you read. endswith ( extension ): tex = extract_member ( subtar, submember ) extracted. name ): extracted = extracted + yodawg_helper ( subtar, submember, extension ) elif submember. extractfile ( member )) for submember in subtar : # Case 1: Another tar or tar.gz! endswith ( 'tar.gz' ): mode = 'r|gz' subtar = tarfile. ''' extracted = # Do we have gzip and not tar?

In this subtar we expect to find the files of interest that we want to parse. I suspect when a user uploads their paper, it gets assigned this unique id that includes the month, year, and then the paper number (5867).Ī member (.tar.gz) that is read into a second tar object, but this one with mode r|gz because it’s gzipped. The reason we check for the extension and that it’s a file is because the top level folder is also a tarinfo object, and we would get an error if we treated it like a file. The “input_tar” once it’s read into memory with tarfile.Įach of the “*.tar.gz” (tarinfo) objects corresponds to a compressed paper (e.g., “1306/”).

We write a scrappy script that expects this to be the first argument (if you are writing a real script, please use argparse)! To a bunch of papers from June (06) 2013 (13). tar.gz files from arXiv, (e.g., “arxiv/1306.tar”) has one subfolder “1306” and this corresponds The variables in the script below include:Ī tar file path that includes a single folder of. That I could untar a file in memory, and then extract the gzip archives within…Īll still in memory! Here is an example.

This isn’t incredibly mindblowing, but I was tickled when I discovered today Yo’ dawg, I heard you like archives, so I put some tar.gz in your tar so you can extract archives from your archives!
