Fork me on GitHub

Project Notes

#225 Parsing with Ruby

About techniques for working with XML, HTML4, and HTML5 from Ruby with gems like nokogiri.

Notes

While XML and HTML is just text and so possible to work with it as string data, better to use a gem that can handle all those cases like validation, encoding and semantic searching.

Nokigiri

Nokogiri (鋸) is the most popular tool for working with XML, HTML4, and HTML5 from Ruby.

HTML with Nokogiri

The Nokogiri::XML class provides entry point for XML parsing:

source = 'https://nokogiri.org/tutorials/installing_nokogiri.html'
Nokogiri::HTML(URI.open(source))

See nokogiri_html_test.rb for test examples of HTML parsing, including for example:

  • find by css doc.css('nav ul.menu li a', 'article h2')
  • find by xpath doc.xpath('//nav//ul//li/a', '//article//h2')
  • find with search doc.search('nav ul.menu li a', '//article//h2')
$ ./nokogiri_html_test.rb
Run options: --seed 3738

# Running:

...

Finished in 1.076714s, 2.7863 runs/s, 5.5725 assertions/s.

3 runs, 6 assertions, 0 failures, 0 errors, 0 skips

XML with Nokogiri

The Nokogiri::XML class provides entry point for XML parsing:

source_file = Pathname.new(File.dirname(__FILE__)).join('data', 'planes.xml')
Nokogiri::XML(File.open(source_file))

See nokogiri_xml_test.rb for test examples of XML parsing, including for example:

  • find by element name doc.xpath("//model")
  • find by element attribute value doc.xpath(%(//seller[@phone="555-222-3333"]))
  • find by element has child doc.xpath(%(//ad[price]))
$ ./nokogiri_xml_test.rb
Run options: --seed 60440

# Running:

...

Finished in 0.002498s, 1200.9608 runs/s, 2802.2419 assertions/s.

3 runs, 7 assertions, 0 failures, 0 errors, 0 skips

Credits and References

About LCK#225 rubyXMLHTMLnokogiri

This page is a web-friendly rendering of my project notes shared in the LittleCodingKata GitHub repository.

Project Source on GitHub Return to the LittleCodingKata Catalog
About LittleCodingKata

LittleCodingKata is my collection of programming exercises, research and code toys broadly spanning things that relate to programming and software development (languages, frameworks and tools).

These range from the trivial to the complex and serious. Many are inspired by existing work and I'll note credits and references where applicable. The focus is quite scattered, as I variously work on things new and important in the moment, or go back to revisit things from the past.

This is primarily a personal collection for my own edification and learning, but anyone who stumbles by is welcome to borrow, steal or reference the work here. And if you spot errors or issues I'd really appreciate some feedback - create an issue, send me an email or even send a pull-request.

Follow the Blog follow projects and notes as they are published in your favourite feed reader