A look at the code behing the blog

2020-11-20

This is going to be very meta, I'm afraid. In the first post of this blog I said that I wasn't sure why I was doing this. Two months later, I am still not entirely sure of my motivation (though being fairly obsessive it allows me to put all my thoughts somewhere instead of rehashing them constantly in my mind, so it does help with that) but I know that one of the small reason was to see if I was able to write down a very simple code to write a functional blog.

As you may have noticed reading this, this website is fairly minimalistic. The first reason is of course that I am not a designer, but the second and main reason is that my biggest issue with Web 2.0 is that it is bloated. If you are looking at the source code of most pages, it is way bigger than it should be, between all the unnecessary javascript calls, the opaque automatically built CSS (typical of blogging environment such as wordpress) , the google analytics and whatnots... It does annoy me primarily because all that extraneous content has a carbon footprint, but also because it obfuscates the web. So I have been doing simple, easily maintainable, functional pages using pure HTML and CSS (and a very tiny bit of javascript, when MathJax or a syntax highlighter is needed). But blogs are a tad more complex to maintain (because of the index, the categories etc.).

So here is what I came up with.

First the template for a blog page (that you can see in full here) has two parts: the first one is the navigation part common to the rest of the website (the thing you see on the left) and the rest is the blogpost itself:

<div class="main">
<p class="blog-title"></p>
<p class="blog-date"></p>
<p class="blog-content">
  <!-- <pre><code class="r">
  </code></pre> -->
</p>
<hr/>
  <div class="footer">
    <a href="2020-11-13.html">&lt; Previous entry</a> | <a href="index.html">Back to Index</a> <!--| <a href="">Next entry &gt;</a>-->
  </div>
</div>

Basically the idea is just to fill in the title, date and content. I do not touch the rest at first. You will notice that the link to the previously published page is already entered (the script does that). Only in the <head> part of the html code I modify the following meta tags:

<meta property="og:title" content="" />
<meta property="og:image" content="" />
<meta name="category" content="">

The first three will define the "overview" of the webpage if shared on facebook, twitter, reddit etc. The fourth one will be used by my code to define in which category the page will be displayed.

And so, the code itself (in R, naturally), to process the blog:

library(XML) #Only uses package XML
process <- function(html_file){
# This function will retrieve the content of each blogpost as a dataframe
# containing the url, the date, the title and the actual content.
  h <-htmlParse(html_file,encoding="utf-8") #Parse the file
  tit <- xpathSApply(h,"//p[@class='blog-title']",xmlValue) #Extract the title
  dat <- xpathSApply(h,"//p[@class='blog-date']",xmlValue) #The date
  desc <- xpathSApply(h,"//p[@class='blog-content']",xmlValue) #the content
  data.frame(url=html_file,title=tit,date=as.Date(dat),content=desc)
}
categories <- function(html_file){
#This one grabs the categories I blogpost were entered under
  h <-htmlParse(html_file,encoding="utf-8")
  x <- as.data.frame(do.call(rbind,xpathApply(h,"//meta[@name='category']",xmlAttrs)))[['content']]
  `if`(length(x),x,NA) #If no category, returns NA
}

setwd("plannapus.github.io/blog") #The path to the blog folder
entries <- dir(pattern="[0-9].html") #takes all the blogpost files

all <- do.call(rbind,lapply(entries,process)) #Apply the process function to each file and turn into a single dataframe
catg <- sapply(entries,categories) #Grab categories
catg <- catg[order(all$date,decreasing=TRUE)] #Reorder both per date (in reverse chronological order)
all <- all[order(all$date,decreasing=TRUE),]
cats <- names(sort(table(unlist(catg)),decreasing=TRUE)) #Retrieve unique categories and sort by number of posts

From that point on it's a lot of metaprogramming where I have html strings that I fill with the appropriate content. First the index page. This is really the most important part, that makes the blog readable. The only thing changing in that page is the table containing all the post titles+dates, and the list of categories:

index <- readLines("index.html",encoding="utf-8") #Reads in the index page
#The following line create each entries in the table
j <- sprintf("\t\t\t<tr><td class=\"date\">%s</td><td class=\"title\"><a href=\"%s\">%s</a></td></tr>",all$date, all$url, all$title)
#Then we replace the former table with the new one:
index <- c(index[1:grep("<table",index)],j,index[grep("</table",index):length(index)])
#And the same with the categories:
index[grep(">Categories:",index)] <- sprintf("\t\t\t<div class=\"footer\">Categories: %s<br/>",
                                             paste(sprintf("<a href=\"categories/%1$s.html\">%1$s</a>",cats),collapse=" - "))
#We then print it back to the file (overriding the previous content):
cat(index,file="index.html",sep="\n")

Then for each category, a specific page:

for(i in seq_along(cats)){
  #Here each page will be built on the same template, which is similar to the index page, with an additional title
  index <- readLines("categories/template.html",encoding="utf-8") #Reads in the template
  w <-sapply(catg,function(x)cats[i]%in%x)
  sub <- all[w,] #Pick the subset having the category currently processed
  #Makes the table, as previously:
  j <- sprintf("\t\t\t<tr><td class=\"date\">%s</td><td class=\"title\"><a href=\"../%s\">%s</a></td></tr>",sub$date, sub$url, sub$title)
  index <- c(index[1:grep("<table",index)],j,index[grep("</table",index):length(index)])
  #Adds the category name as title:
  index[grep("<h3>",index)] <- sprintf("<div class=\"footer\"><h3>Category: %s</h3></div>",cats[i])
  #Save in an html page named as the category for simplicity:
  cat(index,file=sprintf("categories/%s.html",cats[i]),sep="\n")
}

Then there is a section creating the RSS feed: it is very similar to the previous bit. You can check it out in the repository but I don't see the point of explaining it here. After that, there is a couple of lines of code to replace the link to the last blogpost in the template, followed by another few lines to put the link to the new post in the penultimate one:

template <- readLines("template.html") #Reads in template
template <- gsub(as.character(all$date[2]),as.character(all$date[1]),template) #Replace the link to the second-to-last post by the last one
cat(template,file="template.html",sep="\n") #Prints it back

last_html <- entries[grepl(as.character(all$date[2]),entries)] #Finds the second-to-last post
last <- readLines(last_html) #Reads it in
w <- grep("Back to Index",last) #Finds the line with the links
last[w] <- gsub("<!--\\| &lt;a href=\"\">Next entry &gt;</a>-->",
                sprintf("| <a href=\"%s.html\">Next entry &gt;</a>",all$date[1]),
                last[w]) #And add the new one.
cat(last,sep="\n",file=last_html) #Prints it back

And this is literaly all there is to it. I might add some new parts to it as I am figuring things out (such as the categories pages that I really just added in the last week) but I think the core will stay the same. I don't think I need anything fancier.