Ruby Regex Using =~ to Parse Data

I recently had to parse through a large dataset, millions of lines. When you have to parse at this scale regex should be the word buzzing around in your head.

Regex, or regular expressions, are a tool used throughout programming languages, in Google Analytics, I would guess Microsoft Excel, etc. Regex allows you to declare general rules for how a string should or should not be.

How a String Should or Should Not Be

Let’s take an example here:

So here we have a string that has some (not entirely accurate) information about Saied Abbasi. The string includes 5 spaced ‘words’, or smaller strings within it. Let’s pretend the words correspond to these values:

SA0228 = Saied Abbasi’s initials and his birthday of February 28th
PIS = the zodiac sign Pisces
R1 = Ruby 1 year experience
JS1 = JavaScript 1 year experience
WP4 = WordPress 4 years experience

Imagine you have a millions records like this for different individuals and you want the initials of each individual and know how many years of Ruby experience they have.

You would need to write some logic to move through each complete string. Then you would need the internal ‘words’ in an array. So taking a single line with our string example and creating an array:

If we only care about an individual’s initials we could combine Ruby’s gsub method and some regex — a powerful combination!

So if our dataset is kind enough to always include initials followed by a four digit birthday and we only care about retrieving the initials we could use some code like this using our variable words representing an array of strings:

So here we take our array of words and remove the first word, ‘SA0228’. On the variable word we use gsub with an regular expression. We say if the string includes any characters with the exception of capital letters from A to Z, then replace those outliers with “” — a blank space, which removes them.

This leaves us with just SA which we set equal to a variable initials.

When I write regex I give it a quick test over at Rubular.

Now we want to move through the remaining words and find any that match the letter R followed by one or two digits for years of experience. We could make an explicit little method like this:

Here we do a loop using the each method. This will let us cycle through each word in the words array one at a time. Each time through we take an individual word and we check it against the regular expression R\d.

Breaking down =~
When we run the string “R1” and evaluate it against =~ /(R\d)/ the return value is 0. This corresponds to the first index where our regex expression finds a match. In this case the capital R at index 0. If we ran the expression AR1 =~ /R\d/, the return value would be 1 corresponding to the R at index 1.

For clarification, our regex is checking for a capital R followed by any 0-9 value (\d is a shorthand for any digit value). Within the syntax there is another aspect worth highlighting, when we use =~ we wrap our regex in / slashes — one at the beginning and one at the end. The parenthesis make sure the R value is immediately followed by a digit, while a string like RP1 would not return a integer as the regex would not match.

Another thing to highlight, if we ran JS1 =~ /(R\d)/ the return would be nil. In Ruby the only ‘falsy’ values are false and nil. 0 is a ‘truthy’ value. This is really handy because we can use our =~ regex statements in conditional clauses, like we do above. Pretty cool! For lots of information on true and false values in Ruby and other programming languages check this out.

I use word[0] = ‘0’ as a benchmark tested way of quickly replacing the value at index 0 with ‘0’. This then allows me to easily run .to_i to turn our string of numbers into an integer.

From here we have the initials and the years experience in Ruby. You can imagine when tackling a huge dataset how regex and the =~ evaluator can be extremely powerful.

~ تمام شده ~

WordPress’ Event-Driven Architecture vs MVC

So I was thinking recently I wonder which parts of Ruby I have been learning could apply to WordPress. More specifically we just started working with Rails which is a framework for Ruby, the two together giving us Ruby on Rails.

Ruby is our programming language. Rails is a framework that is neither a forward-facing web application nor a programming language. Rails is a library of ruby that allows us to expedite the application building process.

Ruby on Rails employs Model View Controller (MVC) which is a standard for dividing your code base on its function. The Model is the logic. The View is what you see when visiting a webpage. The Controller is an between that requests logic when needed and sends the appropriate view.

WordPress on the other hand is employing an event-driven architecture. Which can be visualized like this:

2013-08-22-10.39.31-1024x768

WordPress’s Event-Driven Architecture uses hooks to make changes at specified event triggers.

For instance when a webpage is called it will hit the head tag. In WordPress you could write an Action hook to say send in this meta description when the head tag is called. A meta description being that little snippet you see below a search result in google. WordPress also offers a secondary type of hook called a filter that is commonly used for manipulating data.

WordPress has similar pieces to Rails but a common theme my research to compare the two of them was the sentiment: not to get too hung up on the comparison because the two do not really translate to one another.

WordPress uses MySQL as its database. The WordPress dashboard — which is WordPress — functions as the Application Layer. And templates are used to render views. There is a lot of commingling of functionality here that would drive a MVC programmer a little mad.

I found this quote by Tom McFarlin that I found helpful in beginning to sort out the differences:

“Simply put, frameworks are not applications, foundations are applications.

Just because a web application can be built using WordPress does not make it a framework. It’s a foundation. It’s an application unto itself that can be extended into further applications.”

For the time being that is pretty much all I wanted to get down on paper (blog paper). A short and early exercise into comparing something familiar to me (WordPress) with the approach to web applications that I am learning (Ruby on Rails).

I leave you with a funny little middle ground I stumbled upon. There is a plugin out there that allows for a MVC framework within WordPress. As I looked through how to setup the plugin, the PHP used to install the basic framework is extremely similar to Ruby. Take a look:

~ تمام شده ~

The Programmer’s Return: Taijutsu

When I first began preparing for the Flatiron School, I learned quickly to avoid max volume on my headphones when watching instructional videos. Programming requires that you spend a great deal of time in Terminal — which is a powerful interface built into your Mac that allow you to manipulate code, create files, yatta yatch.

Terminal reacts to set commands. As a basic example the command puts “Hello” returns “Hello”. The way Terminal works is that you type in the command and nothing happens, until you hit return (aka enter). Pressing enter evaluates whatever command you’ve setup and returns the appropriate response.

In more elaborate situations it is the feedback from the computer to let you know all the wonderful logic you just typed up does indeed equate to what you expected all along.

Additionally, consider this gif used on Learn.co to express a common misconception about programming:

Plainly, that as a programmer you spend 90% reading through code, following the logic. The remaining 10% of your time is actually spent manipulating it.

I think this is the unifying and underlying frustration that leads to The Programmer’s Return — the abnormal cubic force used to slam down the return key. If you watch any programming tutorial listen for the explosion that happens every time they press enter. It is the deafening result of programming mastery.

Many job interviewers will come equipped with tiny but dense rubber bands — requiring programmers to straddle their right pinky into the constraint in order to prove how effective they are at coding.

You’re Welcome!

~ تمام شده ~

A Beginner’s Beginner Guide to Scraping: Nokogiri Ninjutsu

In this post I will explain some of the fundamental concepts needed to utilize the Ruby gem Nokogiri. Gems in Ruby are comparable to plugins in WordPress. They are self-contained, little bundles of code you can drop into an application to gain additional functionality.

For more accurate reading please inject the phrase ‘To my understanding’ at the start of every sentence.

I followed Sam Callender’s wonderful tutorial, which you can find here. You can find the GitHub repo here.

Nokogiri allows you to scrape, or grab, information from a webpage or several webpages. Huh? Imagine you wanted to collect all your neighborhood restaurants to use in a web app you are building. Or if you wanted to grab all the content from a website and search through it to get a sense of what keywords they are targeting.

Nokogiri is often described as a tool to create an API when there is none.

In the tutorial referenced above we are collecting the first 100 Craigslist posts related to pets. We are taking each title and moving it into an excel document. You could also move it into your website’s database. The full potential of this gem is still over my head but I can tell it has potential to allow for very creative and fun implemenation.

So the url we are using for the tutorial is the same url you would see if you went to Craigslist and did a search for pets.

To understand the next step it is also important to realize you can right-click on any website and select view page source to reveal that site’s HTML. So as a simple example imagine this:

HTTParty Gem

The tutorial also utilizes another gem called HTTParty. This gem will allow us to grab the entirety of that source code (the HTML) and convert it into a string, aka some plain text we can manipulate.

So up until this point we have our dependencies (the gems we are requiring) and then a line of code that uses HTTParty to visit Craigslist and grab the page source code. So our ruby code to deploy a web scraping will look like this:

We set the variable page equal to the string we just collected. Using our example HTML to visualize what is happening: our page variable is pointing to something similar* to this:

*The pattern of back slashes is not accurate. The point being you receive all the original code bundled together into plain text.

Nokogiri Object

The next step is to convert the string we are storing in our page variable into a Nokogiri object. You will see below that our page variable is being called as an argument. The Nokogiri operation is being set equal to a new variable parse_page. So at this point our updated code is this:

If you run this process in terminal you will see that parse_page returns the original html, but as an object that we can operate on with Ruby. Which is cool and worth celebrating:

Parse

As of late I been using parse as my go to verb for taking any sort of action over a collection of code. The average person might use this realization and time to google the word parse, I prefer to share my dilemma.

Moving on, the next step is to parse our object — which it seems is just iterating over it and putting each desired bit of information into an array. So first let’s take a look at our updated code:

So we have an empty array that we will eventually store everything in on line 10. Then on lines 12 – 15 we have our iteration method.

The big thing to note here is that Nokogiri allows us to use the .css method on our new object which we are pointing to with the variable parse_page. So in the code the series of .css methods that take css selectors as arguments would be this:

“.css(‘.content’).css(‘.row’).css(‘.hdrlnk’)”

This is some crazy, but slightly informed guessing, but I think what is going on here is equivalent to saying this in css:

.content .row .hdrlnk

These CSS selectors are specific to Craigslist.com. As an example of what is happening, we will use our html from earlier:

If we wanted to grab the phrases “Blah blah blah” and “Yatta Yatch”, our CSS path could be:

#content p

Typically we use CSS to style a webpage, but using the language’s syntax we can select specific sections of HTML.

The CSS selectors used in the tutorial grab all the titles from Craigslist. The next bit of code converts what our links into plain text, using Ruby’s .text method. We see this on line 13.

Lastly we push each string of text, each headline in this case, into our newly created empty array pets_array.

In this particular tutorial the array is then sent into a .csv file — which can be opened in Excel. The updated and final version of the code would be this:

A csv would be a handy way to look at data or manipulate it.

Once the the code is in a Ruby array though we could do anything with it. Send it off to our database and build an application that tracks all things Craigslist pets.

Nokogiri Name

Nokogiri is a type of Japanese saw with teeth that face inward… you see… saw, scraping it all comes together.

nokogiri gem ruby

This post could have possibly been more informative if I was not more or less responsible for giving the world the Kyary gif(t) found above. You are welcome.

Resources & Thanks

Distilled
Engine Yard
Ruby B-word Book
My Brain

~ تمام شده ~