In this post I am just going to show 3 ways to parse and extract HTML documents - a useful way of scraping websites, analysis and conversion of offline documents etc.

  • Nokogiri gem -

    The nokogiri gem is a popular Ruby HTML/XML parser which uses libxml2(a software library for parsing XML documents). Parse HTML with nokogiri using the Nokogiri::HTML method:

```

 require 'nokogiri'
 document = Nokogiri::HTML(input)
  • Oga gem -

    The oga gem is a Ruby XML/HTML parser with a small native extension. Parse HTML with oga using the Oga.parse_html method:

```

 require 'oga'
 document = Oga.parse_html(input)

You might want to use oga if you have difficulties installing nokogiri.

  • Nokogumbo gem -

    The nokogumbo gem is a wrapper for gumbo, Google’s pure-C HTML5 parser. Parse HTML with nokogumbo using the Nokogiri::HTML5 method:

```

 require 'nokogumbo'
 document = Nokogiri::HTML5(input)

Nokogumbo returns nokogiri data structures, which makes it relatively straightforward to switch to from nokogiri.

Parsing HTML fragments

You can also parse fragments of HTML instead of complete documents. Use the fragment class method with nokogiri and nokogumbo, and the same as before with oga:

  • Nokogiri

```

 require 'nokogiri'
 fragment = Nokogiri::HTML.fragment('<span>Hello World</span>')
  • Nokogumbo

```

 require 'nokogumbo'
 fragment = Nokogiri::HTML5.fragment('<span>Hello World</span>')
  • Oga

```

 require 'oga'
 fragment = Oga.parse_html('<span>Hello World</span>')

Searching by CSS selector

The easiest way to identify specific elements in a document is to search for them by CSS selector.

Nokogiri provides the #search method, oga provides the #css method. For example, here’s how you would search for all anchor elements within a document:

  • Nokogiri and Nokogumbo

```

 document.search('a')
  • Oga

```

 document.css('a')

To search for a single element nokogiri provides the #at method, oga provides the #at_css method. For example, searching for the title element:

  • Nokogiri and Nokogumbo

```

 document.at('title')
  • Oga

```

 document.at_css('title')

There are several other techniques like traversing every element, extracting element text, extracting attribute values, extracting attribute hashes, extracting tabular data which I might cover in the next post.

We mostly use nokogiri to parse and extract HTML code, but just out of curiosity I found this two alternatives nokogumbo and oga and thought of sharing it with you all.

Hope this helps.

Your comment

*

*