Help Tinkering with Web Scraping

milesleacy
Valued Contributor

I'm tinkering with web scraping in bash, and I'm getting a result, but it's inelegant and, I suspect, unreliable.

What I'd love to learn how to do is to take the Xth table from a given web page, and place the contents of its rows and cells into a csv file.
Once that's done, I'd want to normalise some data, for example, dumping formatting tags, stripping 'a' tags but retaining the URL, and stripping 'img' tags but retaining the URL to the image.

I also want to do this on a clean macOS system, using only the built-in tools.

I could probably get there if I kept beating on it, but I'm reaching out to the great gestalt brain of the mighty Jamf Nation to see if anyone has already done this. Any suggestions are welcome.

Below is the example I am working with. I do not want to focus on the peculiarities of this particular web page, but rather the generic and portable approach described above.

I'm using this...

curl "https://macadmins.software" | sed -n '/Volume License/,$p' | sed -n '/<table*/,/</table>/p' | sed '/</table>/q' | awk '/Standalone|AutoUpdate/' | awk '!/OneDrive/' | awk '{gsub("<tr>|</tr>|<td align="center">|<td>|<a href=[:47]", "");print}' | awk '{gsub("</td>", ",");print}' | cut -f1 -d"'"

To get this CSV...

Word Standalone,15.36.0 [170702] 64-bit,05026bfc430e183211791b475ae48cd8c664f6b3,https://go.microsoft.com/fwlink/?linkid=525134
Excel Standalone,15.36.0 [170702] 64-bit,a352069615f65d0538c51db1cba605bd6b2d560b,https://go.microsoft.com/fwlink/?linkid=525135
PowerPoint Standalone,15.36.0 [170702] 64-bit,bf53a2e41c7f3338baa4ed5d6581b48bb1ab79f6,https://go.microsoft.com/fwlink/?linkid=525136
Outlook Standalone,15.36.0 [170702] 64-bit,c19e48d453f52e2bd3d6f7fe834a7527c7d9a64b,https://go.microsoft.com/fwlink/?linkid=525137
OneNote Standalone,15.36.0 [170702] 64-bit,421b356c19c15eb931c9aa8f8ebb8b8ae40e12df,https://go.microsoft.com/fwlink/?linkid=820886
Microsoft AutoUpdate (MAU),3.9.1 [170702] 64-bit,52f907a9c429c7fe12e4800f64e8c81241155652,https://go.microsoft.com/fwlink/?linkid=830196

From this page...
https://macadmins.software

4 REPLIES 4

Maxalot
New Contributor III

How are your RegEx chops? That's going the best way to scrape a webpage.

mm2270
Legendary Contributor III

@milesleacy You can try this, but beware that this is also pretty fragile since it relies on Microsoft not reformatting their table at all. As soon as it changes even a little this could break. But that would likely be true with almost any code I think. Webpage tables weren't really meant to be parsed in this way, but I understand why you're doing it. I've run into situations where sometimes its the only thing that works for the end goal.

curl -s "https://macadmins.software" | egrep "Standalone|AutoUpdate" | sed '/OneDrive/d;$d' | awk -F'>|<' '{print $5","$9","$13","$18}' | sed "s/a href=//g;s/'//g" > ~/Desktop/Office.csv

For me, this outputs a file on my Desktop that looks like this when opened in Excel:

74fa04663f1d4ed8af8a08e9b5b7bf08

milesleacy
Valued Contributor

Thanks @mm2270 Thats a greatly simplified version of my mess, but still particular to the example html.

@Maxalot I'm decent a RegEx, and great a googling to expand my understanding.
Could you expand on your suggestion?

Ideally I want to identify whatever is within any given table, tr, and td tags, with and without parameters like align, font, etc, and dump the contents into csv.

Someone has to have cracked this proverbial nut already.

brock_walters
Contributor
sed 's/<[^>]*>/,/g' | tr -s ',' | sed 's/^,*//g;s/,$//g'

As I said something like the 1st regex is I think is what you're after. It's used below. Here are some other ideas:

If you are doing something like this, in my opinion, you should use

xmllint --html --xpath ...

or an xslt style sheet. Pure bash solutions will cause you to go insane.

I had to run your output through tidy to get xmllint to like it.

[http://infohound.net/tidy/]()

Once I did that, I was able to get output somewhat close to what you want with a lot less bash craziness:

$ xmllint --html --xpath "//html/body/table/*[contains(., 'Auto') or contains(., 'Standalone')]" /tmp/file.html 
| grep -v '^s*$' 
| sed 's/.*href="(.*)/1/;s/<[^>]*>//g;s/^[[:space:]]*//g'

Output:

Word Standalone
15.36.0 [170702] 64-bit
05026bfc430e183211791b475ae48cd8c664f6b3
https://go.microsoft.com/fwlink/?linkid=525134">

Excel Standalone
15.36.0 [170702] 64-bit
a352069615f65d0538c51db1cba605bd6b2d560b
https://go.microsoft.com/fwlink/?linkid=525135">

PowerPoint Standalone
15.36.0 [170702] 64-bit
bf53a2e41c7f3338baa4ed5d6581b48bb1ab79f6
https://go.microsoft.com/fwlink/?linkid=525136">

Outlook Standalone
15.36.0 [170702] 64-bit
c19e48d453f52e2bd3d6f7fe834a7527c7d9a64b
https://go.microsoft.com/fwlink/?linkid=525137">

OneNote Standalone
15.36.0 [170702] 64-bit
421b356c19c15eb931c9aa8f8ebb8b8ae40e12df
https://go.microsoft.com/fwlink/?linkid=820886">

OneDrive Standalone
17.3.6945.0724 (170727) 64-bit
efd0de7f9bd10072a27866bbcfe00b52c861c52f
https://go.microsoft.com/fwlink/?linkid=823060">

Microsoft AutoUpdate (MAU)
3.9.1 [170702] 64-bit
52f907a9c429c7fe12e4800f64e8c81241155652
https://go.microsoft.com/fwlink/?linkid=830196">

That would be fairly easy to get into a .csv file.

I really like this tutorial on xslt:

[https://www.w3schools.com/xml/xsl_intro.asp]()

Style sheets are great because you can match them to unusual formats you may have to transform again & save them to use over & over.

To use, run a command like the following:

xsltproc -o /private/tmp/new.csv /private/tmp/style.xslt /private/tmp/file.html

where the .csv is the output file created, the .xslt is the style sheet with the transformation rules (they are not that easy to learn but powerful once you do...) & the .html is the input data.

I hope this is helpful to you & others looking to do this kind of thing.