In this post I am just going to show 3 ways to parse and extract HTML documents - a useful way of scraping websites, analysis and conversion of offline documents etc.

  • Nokogiri gem -

    The nokogiri gem is a popular Ruby HTML/XML parser which uses libxml2(a software library for parsing XML documents). Parse HTML with nokogiri using the Nokogiri::HTML method:


 require 'nokogiri'
 document = Nokogiri::HTML(input)
  • Oga gem -

    The oga gem is a Ruby XML/HTML parser with a small...

On your MAC OSx, if you have upgraded to latest Ruby 2.1.5 (or even 2.1.2) then there are high chances that you will face issues installing "nokogiri" gem. You might face errors like

ERROR: Failed to build gem native extension.

Following are the steps to resolve these issues on your Mac OSx

1) brew install libxml2

2) Read complete error dump on console while you do "bundle" or "bundle install". You would notice that there are useful tips to fix this on your PC, example

for Building Nokogiri...

