Changes

HTML and ConTeXt (view source)

Revision as of 07:29, 16 July 2007

34,091 bytes removed , 07:29, 16 July 2007

no edit summary

[[Image:Wiki_prev.jpg]]

== [[Translating HTML into ConTeXt using Ruby == ~~The next step is to retrieve the HTML pages created in the step above. Here I have used the ruby library 'open-uri' toretrieve the web-page and another libray 'hpricot' to edit these pages and translate html markup into ConTeXt markup.~~ ~~=== Step 1. Open the remote page ===<pre>~~ ~~#scan_page.rb = Retrieves the html page of interest from the server,~~ ~~# navigates to links within the main page and construct a~~ ~~# context document~~ ~~#!/usr/bin/ruby~~ ~~require 'rubygems'~~ ~~require 'open-uri' # the open-uri library~~ ~~require 'hpricot' # the hpricot library~~ ~~require 'scrape_page' # user-defined function to filter html into ConTeXt~~ ~~# scans the home page and lists~~ ~~# all the directories and subdirectories~~ ~~doc=Hpricot(open("http://ipa.dd.re.ss/AnnRep07"))~~ ~~</pre>~~ ~~=== Step 2. Setting up the ConTeXt document ===<pre>~~ ~~mainfil="annrep.tex" # open a file to output ConTeXt document~~ ~~`rm #{mainfil}`~~ ~~fil=File.new(mainfil,"a")~~ ~~# Add some opening directives and include style files~~ ~~fil.write "\\input context_styles \n" # this file contains the styling options for my Context document~~ ~~fil.write "\\starttext \n"~~ ~~fil.write "\\leftaligned{\\BigFontOne Contents} \n"~~ ~~fil.write "\\vfill \n"~~ ~~fil.write "{ \\switchtobodyfont[10pt~~] " ~~fil.write "\\startcolumns[n=2,balance=no,rule=off,option=background,frame=off,background=color,backgroundcolor=blue:1~~] ~~\n"~~ ~~fil.write "\\placecontent \n"~~ ~~fil.write "\\stopcolumns \n"~~ ~~fil.write "}"~~ ~~</pre>~~ ~~=== Step 3. Clicking chapters and section links ===~~ ~~In this example, we created new pages for chapters and sections so that each part of the document couldbe authored by a different person. In Informl new pages are indicated by the CSS class name "existingWikiWord"as shown in the following figure.~~ ~~[[Image~~:~~Wiki_prev2.jpg]].~~ ~~<pre>~~ ~~<p>~~ ~~<a class="existingWikiWord"~~ ~~href="http://localhost:3010/AnnRep07/pages/APCC+Research+and+Development+Projects">~~ ~~APCC Research and Development Projects~~ ~~</a></p>~~ ~~</pre>~~ ~~Knowing this, I have used the following 'hpricot' code to click on chapter and section links to retrievetheir contents.~~ ~~<pre>chapters= (doc/"p/a.existingWikiWord")~~ ~~# we need to navigate one more level into the web page~~ ~~# let us discover the links for that~~ ~~chapters.each do |ch|~~ ~~chap_link = ch.attributes['href']~~ ~~# using inner_html we can create subdirectories~~ ~~chap_name = ch.inner_html.gsub(/\s*/,"")~~ ~~chap_name_org = ch.inner_html~~ ~~# We create chapter directories~~ ~~system("mkdir -p #{chap_name}")~~ ~~fil.write "\\input #{chap_name} \n"~~ ~~chapFil="#{chap_name}.tex"~~ ~~`rm #{chapFil}`~~ ~~cFil=File.new(chapFil,"a")~~ ~~cFil.write "\\chapter{ #{chap_name_org} } \n"~~ ~~</pre>~~ ~~<pre>~~ ~~# We navigate to sections now~~ ~~doc2=~~Using Hpricot~~(open(chap_link))~~ ~~sections= (doc2/"p/a.existingWikiWord")~~ ~~sections.each do |sc|~~ ~~sec_link = sc.attributes['href']~~ ~~sec_name = sc.inner_html.gsub(/\s*/,"")~~ ~~secFil="#{chap_name}/#{sec_name}.tex"~~ ~~`rm #{secFil}`~~ ~~sFil=File.new(secFil,"a")~~ ~~sechFil="#{chap_name}/#{sec_name}.html"~~ ~~`rm #{sechFil}`~~ ~~shFil=File.new(sechFil,"a")~~ ~~</pre>~~ ~~After navigating~~ to ~~sections (h1 elements in HTML) retrieve their contentsand send it to the ruby function "scrape_page.rb" for filtering.~~ ~~<pre>~~ ~~# scrape_the_page(sec_link,"#{chap_name}/#{sec_name}")~~ ~~scrape_the_page(sec_link,sFil,shFil)~~ ~~cFil.write "\\input #{chap_name}/#{sec_name} \n"~~ ~~end~~ ~~end~~ ~~fil.write "\\stoptext \n"~~ ~~</pre>~~ ~~=== Step 4. From HTML to ConTeXt ===~~ ~~Here is where most of the fun happens. I will try to illustrate the HTML to ConTeXt translation for thevarious markup elements one by one.~~ ~~==== Removing unwanted markup ====~~ ~~Not all the markup in HTML is needed. We need to remove them first. The following is based on themarkup used in Informl.~~ ~~<pre>~~ ~~# Function: scrape_page.rb~~ ~~def scrape_the_page(pagePath,oFile,hFile)~~ ~~items_to_remove = [~~ ~~"#menus", #menus notice~~ ~~"div.markedup",~~ ~~"div.navigation",~~ ~~"head", #table of contents~~ ~~"hr"~~ ] ~~doc=Hpricot(open(pagePath))~~ ~~@article = (doc/"#container").each do |content|~~ ~~#remove unnecessary content and edit links~~ ~~items_to_remove.each { |x| (content/x).remove }~~ ~~end~~ ~~</pre>~~ ~~==== Simple replacements ====~~ ~~For many elements we need not do nothing more than translating the HTML elements intocorrespoding ConTeXt elements and filling it up with the "inner html". Elements suchas h1, strong etc are typical examples~~ ~~<pre>~~ ~~# How to replace various syntactic elements using Hpricot~~ ~~# replace p/b element with /f~~ ~~(@article/"p/*/b").each do |pb|~~ ~~pb.swap("{\\bf #{pb.inner_html}}")~~ ~~end~~ ~~# replace p/b element with /bf~~ ~~(@article/"p/b").each do |pb|~~ ~~pb.swap("{\\bf #{pb.inner_html}}")~~ ~~end~~ ~~# replace strong element with /bf~~ ~~(@article/"strong").each do |ps|~~ ~~ps.swap("{\\bf #{ps.inner_html}}")~~ ~~end~~ ~~# replace h1 element with section~~ ~~(@article/"h1").each do |h1|~~ ~~h1.swap("\\section{#{h1.inner_html}}")~~ ~~end~~ ~~# replace h2 element with subsection~~ ~~(@article/"h2").each do |h2|~~ ~~h2.swap("\\subsection{#{h2.inner_html}}")~~ ~~end~~ ~~# replace <pre><code> by equivalent command in context~~ ~~(@article/"pre").each do |pre|~~ ~~pre.swap("\\startcode \n #{pre.at("code").inner_html} \n~~ ~~\\stopcode")~~ ~~end~~ ~~</pre>==== Figures ====~~ ~~<pre>~~ ~~Let's deal with figure references first~~ ~~# when we encounter a reference to a figure inside the html~~ ~~# we replace it with a ConTeXt reference~~ ~~(@article/"a").each do |a|~~ ~~a.swap("\\in[#{a.inner_html}]")~~ ~~end~~ ~~</pre>~~ ~~I have used the "alt" attribute inside the~~ translate HTML ~~element "img" to carry some ConTeXt figuresetup directives such as width and position. To retrieve these I pass on the differentstring elements within "alt" into an array. The "alt" attribute is also used to carry an optional"keyword" for the image. This is useful for referencing figures.~~ ~~<pre>~~ ~~# replace <p><img> by equivalent command in context~~ ~~(@article/"p/img").each do |img|~~ ~~img_attrs=img.attributes['alt'].split(",")~~ ~~</pre>~~ ~~<pre>~~ ~~# ConTeXt can figure out the best image format. So we remove the file extension for images~~ ~~# I have to take care of file names that have a "." embedded in them, so I reverse~~ ~~# the order of the string before operating on it. After I filter, I reverse it again.~~ ~~img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse~~ ~~</pre>~~ ~~<pre>~~ ~~# see if position of figure is indicated~~ ~~img_pos="force"~~ ~~img_attrs.each do |arr|~~ ~~img_pos=arr.gsub("position=","") if arr.match("position=")~~ ~~end~~ ~~img_attrs.delete("position=#{img_pos}") unless img_pos=="force"~~ ~~</pre>~~ ~~<pre>~~ ~~# see if the array img_attrs contains an referral key word~~ ~~if img_attrs.first.match(/\w+[=]\w+/)~~ ~~img_id=" "~~ ~~else~~ ~~img_id=img_attrs.first~~ ~~img_attrs.delete_at(0)~~ ~~end~~ ~~</pre>~~ ~~<pre>~~ ~~if img_pos=="force"~~ ~~if img.attributes['title']~~ ~~img.swap("~~ ~~\\placefigure\n~~ ~~[#{img_pos}][#{img_id}] \n~~ ~~{#{img.attributes['title']}} \n~~ ~~{\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n~~ ") ~~else~~ ~~img.swap("~~ ~~\\placefigure\n~~ ~~[#{img_pos}] \n~~ ~~{none} \n~~ ~~{\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}~~ ") ~~end~~ ~~end~~ ~~end # end of converting inside (@article/"p/img")~~ ~~</pre>~~ ~~==== Tables ====~~ ~~<pre>~~ ~~# Tables : placing them~~ ~~# replace <p><img> by equivalent command in context~~ ~~(@article/"table").each do |tab|~~ ~~if tab.at("caption")~~ ~~tab.swap("~~ ~~\\placetable[split]{#{tab.at("caption").inner_html}}\n~~ ~~{\\bTABLE \n~~ ~~#{tab.inner_html}~~ ~~\\eTABLE}~~ ") ~~else~~ ~~tab.swap("~~ ~~\\placetable[split]{}\n~~ ~~{\\bTABLE \n~~ ~~#{tab.inner_html}~~ ~~\\eTABLE} \n~~ ") ~~end~~ ~~end~~ ~~# Tables: remove the caption~~ ~~(@article/"caption").each do |cap|~~ ~~cap.swap("\n")~~ ~~end~~ ~~</pre>~~ ~~==== The Rest ====<pre>~~ ~~# Now we transfer the syntactically altered html to a string Object~~ ~~# and manipulate that object further~~ ~~newdoc=@article.inner_html~~ ~~# remove empty space in the beginning~~ ~~newdoc.gsub!(/^\s+/,"")~~ ~~# remove all elements we don't need.~~ ~~newdoc.gsub!(/^<div.*/,"")~~ ~~newdoc.gsub!(/^<\/div.*/,"")~~ ~~newdoc.gsub!(/^<form.*/,"")~~ ~~newdoc.gsub!(/^<\/form.*/,"")~~ ~~newdoc.gsub!(/<p>/,"\n")~~ ~~newdoc.gsub!(/<\/p>/,"\n")~~ ~~newdoc.gsub!(/<\u>/,"")~~ ~~newdoc.gsub!(/<\/u>/,"")~~ ~~newdoc.gsub!(/<ul>/,"\\startitemize[1]")~~ ~~newdoc.gsub!(/<\/ul>/,"\\stopitemize")~~ ~~newdoc.gsub!(/<ol>/,"\\startitemize[n]")~~ ~~newdoc.gsub!(/<\/ol>/,"\\stopitemize")~~ ~~newdoc.gsub!(/<li>/,"\\item ")~~ ~~newdoc.gsub!(/<\/li>/,"\n")~~ ~~newdoc.gsub!("_","\\_")~~ ~~newdoc.gsub!(/<table>/,"\\bTABLE \n")~~ ~~newdoc.gsub!(/<\/table>/,"\\eTABLE \n")~~ ~~newdoc.gsub!(/<tr>/,"\\bTR ")~~ ~~newdoc.gsub!(/<\/tr>/,"\\eTR ")~~ ~~newdoc.gsub!(/<td>/,"\\bTD ")~~ ~~newdoc.gsub!(/<\/td>/,"\\eTD ")~~ ~~newdoc.gsub!(/<th>/,"\\bTH ")~~ ~~newdoc.gsub!(/<\/th>/,"\\eTH ")~~ ~~newdoc.gsub!(/<center>/,"")~~ ~~newdoc.gsub!(/<\/center>/,"")~~ ~~newdoc.gsub!(/<em>/,"{\\em ")~~ ~~newdoc.gsub!(/<\/em>/,"}")~~ ~~newdoc.gsub!("^","")~~ ~~newdoc.gsub!("\%","\\%")~~ ~~newdoc.gsub!("&","&")~~ ~~newdoc.gsub!("&",'\\\&')~~ ~~newdoc.gsub!("$",'\\$')~~ ~~newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n")~~ ~~newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n")~~ ~~# Context does not mind "_" in figures and does not recognize \_,~~ ~~# so i have to catch these and replace \_ with _~~ ~~# First catch~~ ~~filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/~~ ~~if newdoc[filter]~~ ~~newdoc.gsub!(filter) { |fString|~~ ~~fString.gsub("\\_","_")~~ } ~~end~~ ~~# Second catch~~ ~~filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/~~ ~~if newdoc[filter2]~~ ~~newdoc.gsub!(filter2) { |fString|~~ ~~fString.gsub("\\_","_") }~~ ~~end~~ ~~# Third catch; remove \_ inside []~~ ~~filter3=/\[\w+\\_\w+\]/~~ ~~if newdoc[filter3]~~ ~~newdoc.gsub!(filter3) { |fString|~~ ~~puts fString~~ ~~fString.gsub("\\_","_") }~~ ~~end~~ ~~# remove the comment tag, which we used to embed context commands~~ ~~newdoc.gsub!("","")~~ ~~# add full path~~ tags to ~~the images~~ ~~newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/")~~ ~~newdoc.gsub!(/<\w+\s*\/>/,"")~~ ~~#puts newdoc~~ ~~# open file for output~~ ~~#outfil="#{oFile}.tex"~~ ~~#`rm #{outfil}`~~ ~~#fil=File.new(outfil,"a")~~ ~~#puts "Writing #{oFile}"~~ ~~oFile.write newdoc~~ ~~end~~ ~~</pre>~~ConText

Saji

43

edits

Changes

HTML and ConTeXt (view source)

Revision as of 07:29, 16 July 2007

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Main

Navigation

Indexes

Interaction

Tools