|
|
Line 39: |
Line 39: |
| [[Image:Wiki_prev.jpg]] | | [[Image:Wiki_prev.jpg]] |
| | | |
− | == Translating HTML into ConTeXt using Ruby ==
| + | [[Translating HTML into ConTeXt using Ruby]] : Using Hpricot to translate HTML tags to ConText |
− | | |
− | The next step is to retrieve the HTML pages created in the step above. Here I have used the ruby library 'open-uri' to
| |
− | retrieve the web-page and another libray 'hpricot' to edit these pages and translate html markup into ConTeXt markup.
| |
− | | |
− | === Step 1. Open the remote page ===
| |
− | <pre>
| |
− | | |
− | #scan_page.rb = Retrieves the html page of interest from the server,
| |
− | # navigates to links within the main page and construct a
| |
− | # context document
| |
− |
| |
− | #!/usr/bin/ruby
| |
− |
| |
− | require 'rubygems'
| |
− | require 'open-uri' # the open-uri library
| |
− | require 'hpricot' # the hpricot library
| |
− | require 'scrape_page' # user-defined function to filter html into ConTeXt
| |
− |
| |
− | # scans the home page and lists
| |
− | # all the directories and subdirectories
| |
− |
| |
− | doc=Hpricot(open("http://ipa.dd.re.ss/AnnRep07"))
| |
− | | |
− | </pre>
| |
− | | |
− | === Step 2. Setting up the ConTeXt document ===
| |
− | <pre>
| |
− |
| |
− | mainfil="annrep.tex" # open a file to output ConTeXt document
| |
− | `rm #{mainfil}`
| |
− | fil=File.new(mainfil,"a")
| |
− | | |
− | # Add some opening directives and include style files
| |
− |
| |
− | fil.write "\\input context_styles \n" # this file contains the styling options for my Context document
| |
− | fil.write "\\starttext \n"
| |
− | fil.write "\\leftaligned{\\BigFontOne Contents} \n"
| |
− | fil.write "\\vfill \n"
| |
− | fil.write "{ \\switchtobodyfont[10pt] "
| |
− | fil.write "\\startcolumns[n=2,balance=no,rule=off,option=background,frame=off,background=color,backgroundcolor=blue:1] \n"
| |
− | fil.write "\\placecontent \n"
| |
− | fil.write "\\stopcolumns \n"
| |
− | fil.write "}"
| |
− | | |
− | </pre>
| |
− | | |
− | === Step 3. Clicking chapters and section links ===
| |
− | | |
− | In this example, we created new pages for chapters and sections so that each part of the document could
| |
− | be authored by a different person. In Informl new pages are indicated by the CSS class name "existingWikiWord"
| |
− | as shown in the following figure.
| |
− | | |
− | [[Image:Wiki_prev2.jpg]].
| |
− | | |
− | <pre>
| |
− | <p>
| |
− | <a class="existingWikiWord"
| |
− | href="http://localhost:3010/AnnRep07/pages/APCC+Research+and+Development+Projects">
| |
− | APCC Research and Development Projects
| |
− | </a>
| |
− | </p>
| |
− | | |
− | </pre>
| |
− | | |
− | Knowing this, I have used the following 'hpricot' code to click on chapter and section links to retrieve
| |
− | their contents.
| |
− | | |
− | <pre>
| |
− | chapters= (doc/"p/a.existingWikiWord")
| |
− |
| |
− | # we need to navigate one more level into the web page
| |
− | # let us discover the links for that
| |
− |
| |
− | chapters.each do |ch|
| |
− | chap_link = ch.attributes['href']
| |
− | # using inner_html we can create subdirectories
| |
− |
| |
− | chap_name = ch.inner_html.gsub(/\s*/,"")
| |
− | chap_name_org = ch.inner_html
| |
− |
| |
− | # We create chapter directories
| |
− | system("mkdir -p #{chap_name}")
| |
− | fil.write "\\input #{chap_name} \n"
| |
− | chapFil="#{chap_name}.tex"
| |
− | `rm #{chapFil}`
| |
− | cFil=File.new(chapFil,"a")
| |
− | cFil.write "\\chapter{ #{chap_name_org} } \n"
| |
− | </pre>
| |
− |
| |
− | <pre>
| |
− | # We navigate to sections now
| |
− | doc2=Hpricot(open(chap_link))
| |
− | sections= (doc2/"p/a.existingWikiWord")
| |
− | sections.each do |sc|
| |
− | sec_link = sc.attributes['href']
| |
− | sec_name = sc.inner_html.gsub(/\s*/,"")
| |
− | | |
− | secFil="#{chap_name}/#{sec_name}.tex"
| |
− | `rm #{secFil}`
| |
− | sFil=File.new(secFil,"a")
| |
− | sechFil="#{chap_name}/#{sec_name}.html"
| |
− | `rm #{sechFil}`
| |
− | shFil=File.new(sechFil,"a")
| |
− |
| |
− | </pre>
| |
− | | |
− | After navigating to sections (h1 elements in HTML) retrieve their contents
| |
− | and send it to the ruby function "scrape_page.rb" for filtering.
| |
− | | |
− | <pre>
| |
− | # scrape_the_page(sec_link,"#{chap_name}/#{sec_name}")
| |
− | scrape_the_page(sec_link,sFil,shFil)
| |
− | cFil.write "\\input #{chap_name}/#{sec_name} \n"
| |
− | end
| |
− | end
| |
− | fil.write "\\stoptext \n"
| |
− | </pre>
| |
− | | |
− | === Step 4. From HTML to ConTeXt ===
| |
− | | |
− | Here is where most of the fun happens. I will try to illustrate the HTML to ConTeXt translation for the
| |
− | various markup elements one by one.
| |
− | | |
− | ==== Removing unwanted markup ====
| |
− | | |
− | Not all the markup in HTML is needed. We need to remove them first. The following is based on the
| |
− | markup used in Informl.
| |
− | <pre>
| |
− | | |
− | # Function: scrape_page.rb
| |
− |
| |
− | def scrape_the_page(pagePath,oFile,hFile)
| |
− | items_to_remove = [
| |
− | "#menus", #menus notice
| |
− | "div.markedup",
| |
− | "div.navigation",
| |
− | "head", #table of contents
| |
− | "hr"
| |
− | ]
| |
− |
| |
− | doc=Hpricot(open(pagePath))
| |
− | @article = (doc/"#container").each do |content|
| |
− | #remove unnecessary content and edit links
| |
− | items_to_remove.each { |x| (content/x).remove }
| |
− | end
| |
− |
| |
− | </pre>
| |
− | | |
− | ==== Simple replacements ====
| |
− | | |
− | For many elements we need not do nothing more than translating the HTML elements into
| |
− | correspoding ConTeXt elements and filling it up with the "inner html". Elements such
| |
− | as h1, strong etc are typical examples
| |
− | | |
− | <pre>
| |
− |
| |
− | # How to replace various syntactic elements using Hpricot
| |
− | # replace p/b element with /f
| |
− | (@article/"p/*/b").each do |pb|
| |
− | pb.swap("{\\bf #{pb.inner_html}}")
| |
− | end
| |
− |
| |
− | # replace p/b element with /bf
| |
− | (@article/"p/b").each do |pb|
| |
− | pb.swap("{\\bf #{pb.inner_html}}")
| |
− | end
| |
− |
| |
− | # replace strong element with /bf
| |
− | (@article/"strong").each do |ps|
| |
− | ps.swap("{\\bf #{ps.inner_html}}")
| |
− | end
| |
− |
| |
− | # replace h1 element with section
| |
− | (@article/"h1").each do |h1|
| |
− | h1.swap("\\section{#{h1.inner_html}}")
| |
− | end
| |
− | | |
− | # replace h2 element with subsection
| |
− | (@article/"h2").each do |h2|
| |
− | h2.swap("\\subsection{#{h2.inner_html}}")
| |
− | end
| |
− | | |
− | # replace <pre><code> by equivalent command in context
| |
− | (@article/"pre").each do |pre|
| |
− | pre.swap("\\startcode \n #{pre.at("code").inner_html} \n
| |
− | \\stopcode")
| |
− | end
| |
− | | |
− | </pre>
| |
− | ==== Figures ====
| |
− |
| |
− | <pre>
| |
− | | |
− | Let's deal with figure references first
| |
− | # when we encounter a reference to a figure inside the html
| |
− | # we replace it with a ConTeXt reference
| |
− |
| |
− | (@article/"a").each do |a|
| |
− | a.swap("\\in[#{a.inner_html}]")
| |
− | end
| |
− | | |
− | </pre>
| |
− | | |
− | I have used the "alt" attribute inside the HTML element "img" to carry some ConTeXt figure
| |
− | setup directives such as width and position. To retrieve these I pass on the different
| |
− | string elements within "alt" into an array. The "alt" attribute is also used to carry an optional
| |
− | "keyword" for the image. This is useful for referencing figures.
| |
− | | |
− | <pre>
| |
− | | |
− | # replace <p><img> by equivalent command in context
| |
− | (@article/"p/img").each do |img|
| |
− |
| |
− | img_attrs=img.attributes['alt'].split(",")
| |
− | | |
− | </pre>
| |
− | | |
− | <pre>
| |
− |
| |
− | # ConTeXt can figure out the best image format. So we remove the file extension for images
| |
− | # I have to take care of file names that have a "." embedded in them, so I reverse
| |
− | # the order of the string before operating on it. After I filter, I reverse it again.
| |
− |
| |
− | img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse
| |
− | | |
− | </pre>
| |
− | | |
− | <pre>
| |
− |
| |
− | # see if position of figure is indicated
| |
− | img_pos="force"
| |
− | img_attrs.each do |arr|
| |
− | img_pos=arr.gsub("position=","") if arr.match("position=")
| |
− | end
| |
− | img_attrs.delete("position=#{img_pos}") unless img_pos=="force"
| |
− | | |
− | </pre>
| |
− | | |
− | <pre>
| |
− |
| |
− | # see if the array img_attrs contains an referral key word
| |
− | if img_attrs.first.match(/\w+[=]\w+/)
| |
− | img_id=" "
| |
− | else
| |
− | img_id=img_attrs.first
| |
− | img_attrs.delete_at(0)
| |
− | end
| |
− | | |
− | </pre>
| |
− | | |
− | <pre>
| |
− |
| |
− | if img_pos=="force"
| |
− | if img.attributes['title']
| |
− | img.swap("
| |
− | \\placefigure\n
| |
− | [#{img_pos}][#{img_id}] \n
| |
− | {#{img.attributes['title']}} \n
| |
− | {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n
| |
− | ")
| |
− | else
| |
− | img.swap("
| |
− | \\placefigure\n
| |
− | [#{img_pos}] \n
| |
− | {none} \n
| |
− | {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}
| |
− | ")
| |
− | end
| |
− | end
| |
− |
| |
− | end # end of converting inside (@article/"p/img")
| |
− | | |
− | </pre>
| |
− | | |
− | ==== Tables ====
| |
− | | |
− | <pre>
| |
− | | |
− | # Tables : placing them
| |
− | # replace <p><img> by equivalent command in context
| |
− | (@article/"table").each do |tab|
| |
− | if tab.at("caption")
| |
− | tab.swap("
| |
− | \\placetable[split]{#{tab.at("caption").inner_html}}\n
| |
− | {\\bTABLE \n
| |
− | #{tab.inner_html}
| |
− | \\eTABLE}
| |
− | ")
| |
− | else
| |
− | tab.swap("
| |
− | \\placetable[split]{}\n
| |
− | {\\bTABLE \n
| |
− | #{tab.inner_html}
| |
− | \\eTABLE} \n
| |
− | ")
| |
− | end
| |
− | end
| |
− |
| |
− | # Tables: remove the caption
| |
− | (@article/"caption").each do |cap|
| |
− | cap.swap("\n")
| |
− | end
| |
− | | |
− | </pre>
| |
− | | |
− | ==== The Rest ====
| |
− | <pre>
| |
− | | |
− | # Now we transfer the syntactically altered html to a string Object
| |
− | # and manipulate that object further
| |
− | | |
− |
| |
− | newdoc=@article.inner_html
| |
− |
| |
− | # remove empty space in the beginning
| |
− | newdoc.gsub!(/^\s+/,"")
| |
− |
| |
− | # remove all elements we don't need.
| |
− | newdoc.gsub!(/^<div.*/,"")
| |
− | newdoc.gsub!(/^<\/div.*/,"")
| |
− | newdoc.gsub!(/^<form.*/,"")
| |
− | newdoc.gsub!(/^<\/form.*/,"")
| |
− | newdoc.gsub!(/<p>/,"\n")
| |
− | newdoc.gsub!(/<\/p>/,"\n")
| |
− | newdoc.gsub!(/<\u>/,"")
| |
− | newdoc.gsub!(/<\/u>/,"")
| |
− | newdoc.gsub!(/<ul>/,"\\startitemize[1]")
| |
− | newdoc.gsub!(/<\/ul>/,"\\stopitemize")
| |
− | newdoc.gsub!(/<ol>/,"\\startitemize[n]")
| |
− | newdoc.gsub!(/<\/ol>/,"\\stopitemize")
| |
− | newdoc.gsub!(/<li>/,"\\item ")
| |
− | newdoc.gsub!(/<\/li>/,"\n")
| |
− | newdoc.gsub!("_","\\_")
| |
− | newdoc.gsub!(/<table>/,"\\bTABLE \n")
| |
− | newdoc.gsub!(/<\/table>/,"\\eTABLE \n")
| |
− | newdoc.gsub!(/<tr>/,"\\bTR ")
| |
− | newdoc.gsub!(/<\/tr>/,"\\eTR ")
| |
− | newdoc.gsub!(/<td>/,"\\bTD ")
| |
− | newdoc.gsub!(/<\/td>/,"\\eTD ")
| |
− | newdoc.gsub!(/<th>/,"\\bTH ")
| |
− | newdoc.gsub!(/<\/th>/,"\\eTH ")
| |
− | newdoc.gsub!(/<center>/,"")
| |
− | newdoc.gsub!(/<\/center>/,"")
| |
− | newdoc.gsub!(/<em>/,"{\\em ")
| |
− | newdoc.gsub!(/<\/em>/,"}")
| |
− | newdoc.gsub!("^","")
| |
− | newdoc.gsub!("\%","\\%")
| |
− | newdoc.gsub!("&","&")
| |
− | newdoc.gsub!("&",'\\\&')
| |
− | newdoc.gsub!("$",'\\$')
| |
− | newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n")
| |
− | newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n")
| |
− |
| |
− | # Context does not mind "_" in figures and does not recognize \_,
| |
− | # so i have to catch these and replace \_ with _
| |
− |
| |
− | # First catch
| |
− | filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/
| |
− |
| |
− | if newdoc[filter]
| |
− | newdoc.gsub!(filter) { |fString|
| |
− | fString.gsub("\\_","_")
| |
− | }
| |
− | end
| |
− |
| |
− | # Second catch
| |
− | filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/
| |
− |
| |
− | if newdoc[filter2]
| |
− | newdoc.gsub!(filter2) { |fString|
| |
− | fString.gsub("\\_","_") }
| |
− | end
| |
− |
| |
− | # Third catch; remove \_ inside []
| |
− | filter3=/\[\w+\\_\w+\]/
| |
− |
| |
− | if newdoc[filter3]
| |
− | newdoc.gsub!(filter3) { |fString|
| |
− | puts fString
| |
− | fString.gsub("\\_","_") }
| |
− | end
| |
− |
| |
− |
| |
− | # remove the comment tag, which we used to embed context commands
| |
− | newdoc.gsub!("<!--","")
| |
− | newdoc.gsub!("-->","")
| |
− | # add full path to the images
| |
− | newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/")
| |
− |
| |
− | newdoc.gsub!(/<\w+\s*\/>/,"")
| |
− |
| |
− | #puts newdoc
| |
− | # open file for output
| |
− | #outfil="#{oFile}.tex"
| |
− | #`rm #{outfil}`
| |
− |
| |
− | #fil=File.new(outfil,"a")
| |
− | #puts "Writing #{oFile}"
| |
− | oFile.write newdoc
| |
− |
| |
− | end
| |
− | </pre>
| |