Skip to content →

Tag: python

Run a Python program on every file in a directory

Use the cmd line to run a python program on every file in a directory

It took me several hours to get the command line right to run a python program on every file in directory and send the results to the same output file.

I am jotting down my notes here because it seems that much of the advice on the internet is plain wrong.  This is what worked for me

Assuming

  • We are working on a regular computer using Windows
  • Python is already installed
  • We have a Python program called myprog.py
  • We have data we want to analyse with that program
  • We have many, many data files in a directory and we want to analyze them one-by-one and put the results in one file.
  • The program file expects one or more data files as sys.argv. That is, the program expects us to tell it where to find the data files in the cmd line using this format.
    • Myprog.py sys.argv1 sysargv2 sysargv3  >output.txt

If we only have a few data files, or we are only using a few files in a directory, then we don’t need this procedure. It is easier simply to type out the command as shown above.

If we want to run the program on a thousand files, say, it would be tedious to type out the filenames (in place of sys.argv).

Gist of procedure to use cmd line to run a python program on every file in a directory

To run a python program on every file in directory, we use the for command.  Here are the steps in an orderly way.

Steps:

#1 Locate the directories of the python program and the data

For example

Python program physically sits in c:python27myprog.  The name of the program is progname.py.

Data physically sits on a removable drive e:mydata

#2 Decide where we are going to send the results

I always send my results to the same directory as the program but that is not necessary.  The reason I do that there is less chance of a type if I don’t have to type in the directory name and it is relatively easy to move a file to another directory manually.

In my example

My results will go to the same file as the directory c:python27myprog

#3 Open the cmd line in Windows

We are going to give commands directly to DOS, the operating system underneath Windows.  You may dimly remember it from 1980’s computers.  Yes – it still runs Windows machines.

  • Go to Start (bottom left – where you normally switch off your computer)
  • Select Run
  • Type cmd<enter> into the box : a black square should pop up.

#4 Set the directory of our program as our working directory

In the “command line window”, you should see something like c:mydocuments . .  Windows machines try to make us work in My Documents.   I change this to my program directory using

cd c:python27myprog

That means change directory to c:python27myprog

#5 Use the cmd line to tell python to run myprog iteratively with every file in the data directory

for %f in (e:mydata*.txt) do myprog.py %f >output.txt

Notes

  • Look at the syntax : for  variable in (set) do myprog.py variable >outputfilename
  • The variable name must be %f or %x or %t.  A percentage followed by one letter. Everything else throws up an error.  The variable name must be the same in both uses.
  • The directory holding the data is put in brackets : (dirname)
  • >outputfile.txt sends results to a textfile in the same directory as our program.  It must be a .txt file.

Conclusion: Use cmd line to run a python program on all the files in directory

And remember that if the program takes 1 seconds to run for one file, then for 1800 files, as I am running now,  then  the program takes 30 minutes to run.  If the original program takes 2 sec to run, the total batch will take an hour.

Leave a Comment

Strip whitespace from a document in elementary Python scraping

Tsingy 010 by Olivier Lejade via FlickrSimple python scraper

I’ve been using Python to extract text from textfiles that I created from pdf documents using opensource PDFToText.

Strip() does not strip all whitespace

There is a nifty python command which strips off leading and lagging white space.  For example, to clean up a line containing the phrase that we seek, we simply write

line = line.strip()

or

line=line.rstrip()

line=line.lstrip()

Easy peasy ~ except that my phrases were being returned with a long list of whitespace to the right.

The whitespace is not visible to the naked eye of course and I deduced that from the presence of word~long blank~closing quotes.

NBSP will not be stripped by strip()

My own diagnosis took me up some blind alleys.  In a thoroughly confused state, I sought help on StackOverflow.  While I slept, a helpful chap in Australia read me the riot act on confusion and told me what was likely to be my problem.

The whitespace wasn’t a space after all. It is a NBSP – non break space.  That is, a marker that is the opposite of forcing a page break – it prevents  a page break at that point.

Knowing this, all I had to do now was search for NBSP using its ASCII code “XA0” ( and that 0 is zero).

Simple python code to find and strip NBSP

So this is what I did:

I compiled a search term as

Snap = re.compile(r”””

(XA0)                    # searching for NBSP that shows up as white space but doesn’t leave with strip()

“””, re.X)                    # re.X allows this verbose layout with comments

matchObj =  snap.search(line)

if matchObj:

# Discard the line at the point where matchObj starts

line = line(matchObj:)

else:

pass

Clean NBSP from textfiles using Python

Hey presto – my line is cleaned up and the offending NBSP have gone.

 

Leave a Comment