Open main menu

Changes

25,210 bytes added ,  07:19, 16 July 2007
</pre>
=== Filtering Step 4. From HTML to ConTeXt === Here is where most of the fun happens. I will try to illustrate the HTML to ConTeXt translation for thevarious markup elements one by one.  ==== Removing unwanted markup ==== Not all the markup in HTML is needed. We need to remove them first. The following is based on themarkup used in Informl. <pre> # Function: scrape_page.rb def scrape_the_page(pagePath,oFile,hFile) items_to_remove = [ "#menus", #menus notice "div.markedup", "div.navigation", "head", #table of contents "hr" ] doc=Hpricot(open(pagePath)) @article = (doc/"#container").each do |content| #remove unnecessary content and edit links items_to_remove.each { |x| (content/x).remove } end </pre> ==== Simple replacements ==== For many elements we need not do nothing more than translating the HTML elements into correspoding ConTeXt elements and filling it up with the "inner html". Elements suchas h1, strong etc are typical examples <pre> # How to replace various syntactic elements using Hpricot # replace p/b element with /f (@article/"p/*/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace p/b element with /bf (@article/"p/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace strong element with /bf (@article/"strong").each do |ps| ps.swap("{\\bf #{ps.inner_html}}") end # replace h1 element with section (@article/"h1").each do |h1| h1.swap("\\section{#{h1.inner_html}}") end  # replace h2 element with subsection (@article/"h2").each do |h2| h2.swap("\\subsection{#{h2.inner_html}}") end  # replace <pre><code> by equivalent command in context (@article/"pre").each do |pre| pre.swap("\\startcode \n #{pre.at("code").inner_html} \n \\stopcode") end  </pre>==== Figures ==== <pre> Let's deal with figure references first # when we encounter a reference to a figure inside the html # we replace it with a ConTeXt reference (@article/"a").each do |a| a.swap("\\in[#{a.inner_html}]") end  </pre> I have used the "alt" attribute inside the HTML element "img" to carry some ConTeXt figuresetup directives such as width and position. To retrieve these I pass on the differentstring elements within "alt" into an array. The "alt" attribute is also used to carry an optional"keyword" for the image. This is useful for referencing figures. <pre> # replace <p><img> by equivalent command in context (@article/"p/img").each do |img| img_attrs=img.attributes['alt'].split(",")  </pre> <pre> # ConTeXt can figure out the best image format. So we remove the file extension for images # I have to take care of file names that have a "." embedded in them, so I reverse # the order of the string before operating on it. After I filter, I reverse it again. img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse  </pre> <pre> # see if position of figure is indicated img_pos="force" img_attrs.each do |arr| img_pos=arr.gsub("position=","") if arr.match("position=") end img_attrs.delete("position=#{img_pos}") unless img_pos=="force"  </pre> <pre> # see if the array img_attrs contains an referral key word if img_attrs.first.match(/\w+[=]\w+/) img_id=" " else img_id=img_attrs.first img_attrs.delete_at(0) end  </pre> <pre> if img_pos=="force" if img.attributes['title'] img.swap(" \\placefigure\n [#{img_pos}][#{img_id}] \n {#{img.attributes['title']}} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} ") end end end # end of converting inside (@article/"p/img")  </pre> ==== Tables ==== <pre> # Tables : placing them # replace <p><img> by equivalent command in context (@article/"table").each do |tab| if tab.at("caption") tab.swap(" \\placetable[split]{#{tab.at("caption").inner_html}}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} ") else tab.swap(" \\placetable[split]{}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} \n ") end end # Tables: remove the caption (@article/"caption").each do |cap| cap.swap("\n") end  </pre> ==== The Rest ====<pre> # Now we transfer the syntactically altered html to a string Object # and manipulate that object further   newdoc=@article.inner_html # remove empty space in the beginning newdoc.gsub!(/^\s+/,"") # remove all elements we don't need. newdoc.gsub!(/^<div.*/,"") newdoc.gsub!(/^<\/div.*/,"") newdoc.gsub!(/^<form.*/,"") newdoc.gsub!(/^<\/form.*/,"") newdoc.gsub!(/<p>/,"\n") newdoc.gsub!(/<\/p>/,"\n") newdoc.gsub!(/<\u>/,"") newdoc.gsub!(/<\/u>/,"") newdoc.gsub!(/<ul>/,"\\startitemize[1]") newdoc.gsub!(/<\/ul>/,"\\stopitemize") newdoc.gsub!(/<ol>/,"\\startitemize[n]") newdoc.gsub!(/<\/ol>/,"\\stopitemize") newdoc.gsub!(/<li>/,"\\item ") newdoc.gsub!(/<\/li>/,"\n") newdoc.gsub!("_","\\_") newdoc.gsub!(/<table>/,"\\bTABLE \n") newdoc.gsub!(/<\/table>/,"\\eTABLE \n") newdoc.gsub!(/<tr>/,"\\bTR ") newdoc.gsub!(/<\/tr>/,"\\eTR ") newdoc.gsub!(/<td>/,"\\bTD ") newdoc.gsub!(/<\/td>/,"\\eTD ") newdoc.gsub!(/<th>/,"\\bTH ") newdoc.gsub!(/<\/th>/,"\\eTH ") newdoc.gsub!(/<center>/,"") newdoc.gsub!(/<\/center>/,"") newdoc.gsub!(/<em>/,"{\\em ") newdoc.gsub!(/<\/em>/,"}") newdoc.gsub!("^","") newdoc.gsub!("\%","\\%") newdoc.gsub!("&amp;","&") newdoc.gsub!("&",'\\\&') newdoc.gsub!("$",'\\$') newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n") newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n") # Context does not mind "_" in figures and does not recognize \_, # so i have to catch these and replace \_ with _ # First catch filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/ if newdoc[filter] newdoc.gsub!(filter) { |fString| fString.gsub("\\_","_") } end # Second catch filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/ if newdoc[filter2] newdoc.gsub!(filter2) { |fString| fString.gsub("\\_","_") } end # Third catch; remove \_ inside [] filter3=/\[\w+\\_\w+\]/ if newdoc[filter3] newdoc.gsub!(filter3) { |fString| puts fString fString.gsub("\\_","_") } end # remove the comment tag, which we used to embed context commands newdoc.gsub!("<!--","") newdoc.gsub!("-->","") # add full path to the images newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/") newdoc.gsub!(/<\w+\s*\/>/,"") #puts newdoc # open file for output #outfil="#{oFile}.tex" #`rm #{outfil}` #fil=File.new(outfil,"a") #puts "Writing #{oFile}" oFile.write newdoc end </pre>
43

edits