Skip to content →

Strip whitespace from a document in elementary Python scraping

Tsingy 010 by Olivier Lejade via FlickrSimple python scraper

I’ve been using Python to extract text from textfiles that I created from pdf documents using opensource PDFToText.

Strip() does not strip all whitespace

There is a nifty python command which strips off leading and lagging white space.  For example, to clean up a line containing the phrase that we seek, we simply write

line = line.strip()

or

line=line.rstrip()

line=line.lstrip()

Easy peasy ~ except that my phrases were being returned with a long list of whitespace to the right.

The whitespace is not visible to the naked eye of course and I deduced that from the presence of word~long blank~closing quotes.

NBSP will not be stripped by strip()

My own diagnosis took me up some blind alleys.  In a thoroughly confused state, I sought help on StackOverflow.  While I slept, a helpful chap in Australia read me the riot act on confusion and told me what was likely to be my problem.

The whitespace wasn’t a space after all. It is a NBSP – non break space.  That is, a marker that is the opposite of forcing a page break – it prevents  a page break at that point.

Knowing this, all I had to do now was search for NBSP using its ASCII code “XA0” ( and that 0 is zero).

Simple python code to find and strip NBSP

So this is what I did:

I compiled a search term as

Snap = re.compile(r”””

(XA0)                    # searching for NBSP that shows up as white space but doesn’t leave with strip()

“””, re.X)                    # re.X allows this verbose layout with comments

matchObj =  snap.search(line)

if matchObj:

# Discard the line at the point where matchObj starts

line = line(matchObj:)

else:

pass

Clean NBSP from textfiles using Python

Hey presto – my line is cleaned up and the offending NBSP have gone.

 

Similar Posts:

Published in SOCIAL MEDIA & IT

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *