A Beginner’s Beginner Guide to Scraping: Nokogiri Ninjutsu

In this post I will explain some of the fundamental concepts needed to utilize the Ruby gem Nokogiri. Gems in Ruby are comparable to plugins in WordPress. They are self-contained, little bundles of code you can drop into an application to gain additional functionality.

For more accurate reading please inject the phrase ‘To my understanding’ at the start of every sentence.

I followed Sam Callender’s wonderful tutorial, which you can find here. You can find the GitHub repo here.

Nokogiri allows you to scrape, or grab, information from a webpage or several webpages. Huh? Imagine you wanted to collect all your neighborhood restaurants to use in a web app you are building. Or if you wanted to grab all the content from a website and search through it to get a sense of what keywords they are targeting.

Nokogiri is often described as a tool to create an API when there is none.

In the tutorial referenced above we are collecting the first 100 Craigslist posts related to pets. We are taking each title and moving it into an excel document. You could also move it into your website’s database. The full potential of this gem is still over my head but I can tell it has potential to allow for very creative and fun implemenation.

So the url we are using for the tutorial is the same url you would see if you went to Craigslist and did a search for pets.

To understand the next step it is also important to realize you can right-click on any website and select view page source to reveal that site’s HTML. So as a simple example imagine this:

HTTParty Gem

The tutorial also utilizes another gem called HTTParty. This gem will allow us to grab the entirety of that source code (the HTML) and convert it into a string, aka some plain text we can manipulate.

So up until this point we have our dependencies (the gems we are requiring) and then a line of code that uses HTTParty to visit Craigslist and grab the page source code. So our ruby code to deploy a web scraping will look like this:

We set the variable page equal to the string we just collected. Using our example HTML to visualize what is happening: our page variable is pointing to something similar* to this:

*The pattern of back slashes is not accurate. The point being you receive all the original code bundled together into plain text.

Nokogiri Object

The next step is to convert the string we are storing in our page variable into a Nokogiri object. You will see below that our page variable is being called as an argument. The Nokogiri operation is being set equal to a new variable parse_page. So at this point our updated code is this:

If you run this process in terminal you will see that parse_page returns the original html, but as an object that we can operate on with Ruby. Which is cool and worth celebrating:

Parse

As of late I been using parse as my go to verb for taking any sort of action over a collection of code. The average person might use this realization and time to google the word parse, I prefer to share my dilemma.

Moving on, the next step is to parse our object — which it seems is just iterating over it and putting each desired bit of information into an array. So first let’s take a look at our updated code:

So we have an empty array that we will eventually store everything in on line 10. Then on lines 12 – 15 we have our iteration method.

The big thing to note here is that Nokogiri allows us to use the .css method on our new object which we are pointing to with the variable parse_page. So in the code the series of .css methods that take css selectors as arguments would be this:

“.css(‘.content’).css(‘.row’).css(‘.hdrlnk’)”

This is some crazy, but slightly informed guessing, but I think what is going on here is equivalent to saying this in css:

.content .row .hdrlnk

These CSS selectors are specific to Craigslist.com. As an example of what is happening, we will use our html from earlier:

If we wanted to grab the phrases “Blah blah blah” and “Yatta Yatch”, our CSS path could be:

#content p

Typically we use CSS to style a webpage, but using the language’s syntax we can select specific sections of HTML.

The CSS selectors used in the tutorial grab all the titles from Craigslist. The next bit of code converts what our links into plain text, using Ruby’s .text method. We see this on line 13.

Lastly we push each string of text, each headline in this case, into our newly created empty array pets_array.

In this particular tutorial the array is then sent into a .csv file — which can be opened in Excel. The updated and final version of the code would be this:

A csv would be a handy way to look at data or manipulate it.

Once the the code is in a Ruby array though we could do anything with it. Send it off to our database and build an application that tracks all things Craigslist pets.

Nokogiri Name

Nokogiri is a type of Japanese saw with teeth that face inward… you see… saw, scraping it all comes together.

nokogiri gem ruby

This post could have possibly been more informative if I was not more or less responsible for giving the world the Kyary gif(t) found above. You are welcome.

Resources & Thanks

Distilled
Engine Yard
Ruby B-word Book
My Brain