I’ve been using Python to extract text from textfiles that I created from pdf documents using opensource PDFToText.
Strip() does not strip all whitespace
There is a nifty python command which strips off leading and lagging white space. For example, to clean up a line containing the phrase that we seek, we simply write
line = line.strip()
Easy peasy ~ except that my phrases were being returned with a long list of whitespace to the right.
The whitespace is not visible to the naked eye of course and I deduced that from the presence of word~long blank~closing quotes.
NBSP will not be stripped by strip()
My own diagnosis took me up some blind alleys. In a thoroughly confused state, I sought help on StackOverflow. While I slept, a helpful chap in Australia read me the riot act on confusion and told me what was likely to be my problem.
The whitespace wasn’t a space after all. It is a NBSP – non break space. That is, a marker that is the opposite of forcing a page break – it prevents a page break at that point.
Knowing this, all I had to do now was search for NBSP using its ASCII code “XA0” ( and that 0 is zero).
Simple python code to find and strip NBSP
So this is what I did:
I compiled a search term as
Snap = re.compile(r”””
(XA0) # searching for NBSP that shows up as white space but doesn’t leave with strip()
“””, re.X) # re.X allows this verbose layout with comments
matchObj = snap.search(line)
# Discard the line at the point where matchObj starts
line = line(matchObj:)
Clean NBSP from textfiles using Python
Hey presto – my line is cleaned up and the offending NBSP have gone.