Been churning along at a fairly fast pace through the Ruby Object Oriented Track at Community-Powered Bootcamp. Finally got to a project that doesn’t feel like something from a text book, actually something that I had to pick stuff for and make from the start. It was a web scraping project where I had to pull from a website I decide and then provide a command line interface to interact with it.

I decided to pull from Hacker News and have it output five headlines at a time based on their rank, you just had to give it a rank number you wanted to start listing from. I might eventually go back and rework it as my ruby knowledge improves, but for now its good to see an almost real world usage of programming knowledge. Seems so many courses are focused on Tic Tac Toe games or Rock Paper Scissors or something else like that, rather than getting to actually make something that could be used for something at some point.

It utilizes Nokogiri and OpenURI. First thing I had to do was pull the url into it so it can then break that down into a format that I can then parse.

So this was the initial opening of the web address:

def get_page
  Nokogiri::HTML(open("https://news.ycombinator.com/"))
end

Then I had to find the spot where the text is stored, surprisingly enough, they still use HTML tables:

def scrape_homepage
  self.get_page.css("table.itemlist tr.athing")
end

Then I used an iterator to assign the various parts, I have everything separated out into classes, but to simplify it for sake of explanation, this is similar to what I did to split out the rank, title, url, and website.

scrape_homepage.each do |article|
  rank = article.css("span.rank").text,
  title = article.css("a.storylink").text,
  url = article.css("td.title a")[0]["href"],
  site = article.css("span.sitestr").text
end

Now to output it (also simplified it a bit):

def list_headlines(start_number)
  @headlines[start_number-1, 5].each do |headline|
    puts "\n\n#{headline.rank} \"#{headline.title}\" from the site: #{headline.site}"
    puts "Full Web Address: #{headline.url}"
  end
end

Application in action:

Walkthrough of the Hacker News Reader

If you want to check it out on its repository on github.