Strip whitespace from a document in elementary Python scraping

Simple python scraper

I’ve been using Python to extract text from textfiles that I created from pdf documents using opensource PDFToText.

Strip() does not strip all whitespace

There is a nifty python command which strips off leading and lagging white space. For example, to clean up a line containing the phrase that we seek, we simply write

line = line.strip()

line=line.rstrip()

line=line.lstrip()

Easy peasy ~ except that my phrases were being returned with a long list of whitespace to the right.

The whitespace is not visible to the naked eye of course and I deduced that from the presence of word~long blank~closing quotes.

NBSP will not be stripped by strip()

My own diagnosis took me up some blind alleys. In a thoroughly confused state, I sought help on StackOverflow. While I slept, a helpful chap in Australia read me the riot act on confusion and told me what was likely to be my problem.

The whitespace wasn’t a space after all. It is a NBSP – non break space. That is, a marker that is the opposite of forcing a page break – it prevents a page break at that point.

Knowing this, all I had to do now was search for NBSP using its ASCII code “XA0” ( and that 0 is zero).

Simple python code to find and strip NBSP

So this is what I did:

I compiled a search term as

Snap = re.compile(r”””

(XA0) # searching for NBSP that shows up as white space but doesn’t leave with strip()

“””, re.X) # re.X allows this verbose layout with comments

matchObj = snap.search(line)

if matchObj:

# Discard the line at the point where matchObj starts

line = line(matchObj:)

else:

pass

Clean NBSP from textfiles using Python

Hey presto – my line is cleaned up and the offending NBSP have gone.

Strip whitespace from a document in elementary Python scraping

Simple python scraper

Strip() does not strip all whitespace

NBSP will not be stripped by strip()

Simple python code to find and strip NBSP

Clean NBSP from textfiles using Python

Like this:

Comments

Leave a Reply Cancel reply