A Practical Haskell Retrospective: Using Parsec REST and Pandoc to Scrape Jira

Posted on May 12, 2014 by Handré Stolp

Introduction

This post is in preparation for a talk that I will be giving at my local functional programming users group. It is a retrospective on the first bit of practical Haskell code I wrote. It is a “tool” that we use internally to generate official specification documents from our issue tracker Jira.

The post is gives introductions to some of the technologies that I used on my journey learning Haskell. My background going into this is 10 years of advanced C++ and some Lua in connection with essentially game engine development. I don’t have a background in advanced mathematics nor had I had exposure to any functional languages.

Background

I work for a company that makes training simulators and we use Jira as our issue tracking system. Some of our clients require official specification documents that need to be configured and signed off on. We wanted to keep Jira as the one place where we track everything and we also wanted the generation of the official specification documents to be as automated as possible. I could not find a Jira plug-in or tool that fit our needs exactly so I decided to write something myself.

The truth is I had recently read Learn You a Haskell for Great Good! and Real World Haskell, and I thought, here is an opportunity to, actually, learn Haskell by writing a tool. I wasn’t naive enough to think I could write such a tool from scratch, I first scouted the landscape to see what was possible.

I knew Jira exposed a RESTful HTTP interface, not that I knew exactly what REST was, and found I could talk to it using the Haskell package http-conduit. I had consulted the oracle Google and discovered I could use the all powerful Pandoc to convert almost any format to any other; well sort of, but good enough for me. Pandoc did not have a parser for Jira markup but I remembered that Haskell has the mighty weapon Parsec which would make the job of writing my own parser a breeze.

What the tool had to do

We organized our issues in Jira hierarchically using the Structure plug-in. This issue hierarchy then defines the outline of our specification document. The headings in the document would be the issue ID along with the summary and the content would be the issue description.

So the tool had to talk Jira using HTTP requests to get the issues and the hierarchy. It then had to parse the descriptions and generate a Pandoc AST. The Pandoc AST would then be used to generate a MS Word document. It essentially only had to glue together several libraries with the only significant work being the Jira markup parser.

Talking to Jira

Jira’s RESTlike Interface

You can talk to Jira with the HTTP protocol using the Jira REST APIs. Whether the API is 100% RESTful, I am not actually clued up enough to say. REST stands for representational state transfer and is an architectural style described by Roy Fielding in 2000 in his doctoral thesis. REST has become an internet buzz word and a lot of HTTP-based interfaces call themselves RESTful even though they technically aren’t (REST APIs must be hypertext-driven). It seems a lot of HTTP-based web-service interfaces which aren’t SOAP based are called RESTful but how RESTful they are is up for debate.

You can have a look at this layman’s example on stack overflow “What exactly is RESTful programming?”. My probably incorrect take on how RESTful the Jira REST APIs are, is that they fall short by using the URI hierarchy as part of the protocol instead of the media-type. I guess what can be done with a resource should be self described by the data returned as part of a request and its media-type.

So lets just call these web-service interfaces RESTlike and give an incomplete, incorrect and simplified overview of what is involved (But look at the Wikipedia entry about REST constraints).

HTTP-Conduit

In order to pick the Jira server’s brain about all the little issues it has, we need to send and receive HTTP requests. Luckily we do not have to do this manually, there is a very easy to use Haskell library called http-conduit. It has nice examples in its documentation so I will show an example using it to get an issue from Jira.

import Network.HTTP.Conduit
import Network.HTTP.Types.Header
import Network.Connection (TLSSettings (..))
import Network.Socket(withSocketsDo)
import qualified Data.ByteString.Lazy.Char8 as B

fetchTestIssue :: IO ()
fetchTestIssue = do
        -- We use a demo instance of Jira available on the web
   let  _host =  "https://jira.atlassian.com"
        -- We use the latest version of REST API to select one of the issues
        -- and we limit the returned fields to be the summary and description only
        uri  = _host ++ "/rest/api/latest/issue/DEMO-3083?fields=summary,description"
        -- This is just to ignore the failure of the security certificate
        settings = mkManagerSettings (TLSSettingsSimple True False False) Nothing 
   -- We make the request by parsing the URL, the request method by default is get
   request  <- parseUrl uri
   -- We do the request and receive the response
   response <- withSocketsDo $ withManagerSettings settings $ httpLbs request
        -- We select some of the headers of the response
   let  hdr = filter g . responseHeaders $ response
        g (h, _) | h == hContentType = True
        g (h, _) | h == hServer      = True
        g _                          = False
        -- We get the response body
        bdy = responseBody response
   -- print the selected headers and response body
   putStrLn $ "Response headers = \n" ++ show hdr
   putStrLn "Response body = "
   B.putStrLn bdy 

And this is the response we get:

Response headers = 
[("Server","nginx"),("Content-Type","application/json;charset=UTF-8")]
Response body = 
{"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelog",
"id":"333132",
"self":"https://jira.atlassian.com/rest/api/latest/issue/333132",
"key":"DEMO-3083",
"fields":{"summary":"Backspace to delete zero to enter your dosage ",
"description":"You have to delete zero first before you can put in your Dosage"}}

We see that the content type is application/json and that the response body has some extra information along with the fields that were requested. There is a convenient library for serializing and deserializing JSON encoded data called aeson.

Aeson

Aeson was the farther of Jason in Greek mythology and this library is the big daddy of JSON parsing. Aeson allows you to specify how to encode and decode Haskell types to and from JSON. As an added bonus the YAML package uses the exact same type classes to encode and decode to and from YAML.

In order for your type to be encoded as JSON it must be a member of the ToJSON type class and if you want to turn some JSON into your type then you need a FromJSON instance. With the DeriveGeneric GHC extension these instances can automatically be derived for your types. Of course the JSON you receive will not always match the structure of the types you want to use internally, and in this case you would manually define how to map from JSON to your type.

Even when you have to manually define the mapping from JSON to your type it is quite easy to do with very little overhead. It usually involves using only a few parser combinators that act on Aeson’s representation of JSON values. A useful trick to use when decoding a specific JSON response to your type, is to wrap your type in a newtype, and then define the FromJSON instance for the wrapper type. Below is an example decoding the response from Jira to an internal type and then encoding it to YAML before printing it out again.

{-# LANGUAGE OverloadedStrings, DeriveGeneric #-}
import           Network.HTTP.Conduit
import           Network.Connection (TLSSettings (..))
import           Network.Socket(withSocketsDo)
import           Control.Applicative
import qualified Data.Aeson as AS
import           Data.Aeson ((.:), (.:?), (.!=))
import qualified Data.Aeson.Types as AS (typeMismatch)
import qualified Data.Yaml as YAML
import qualified Data.ByteString.Char8 as B
import           GHC.Generics


-- The data type that will represent our issue
data Issue = Issue {issueId :: Int, issueKey, issueSummary, issueDescription :: String} 
             deriving (Eq, Show, Read, Generic)
-- Automatically derive instances for our issue type allowing is to encode/decode
-- to and from JSON and YAML. 
instance AS.ToJSON Issue 
instance AS.FromJSON Issue 

-- The newtype wrapper used to decode the JSON response body received
-- from the Jira server
newtype IssueResponse = IssueResponse {issueFromResponse :: Issue}

-- Manually define how to turn a JSON representation into a IssueResponse
instance AS.FromJSON IssueResponse where
    parseJSON (AS.Object v) = do                -- v is the parsed JSON object
        fields <- v .: "fields"                 -- select the fields member
        -- Lift the Issue constructor into the parsing monad and
        -- apply it to the results of looking up values in the JSON object
        Issue <$> (read <$> v .: "id")          -- select id member as an Int
              <*> v .: "key"                    -- select key member
              <*> fields .: "summary"           -- select summary from the fields
              <*> fields .:? "description"      -- optionally select description
                                                -- from the fields.
                         .!= "No description"   -- if it is not present then this
                                                -- will be the default value
        -- Wrap the result type in IssueResponse
        >>= pure . IssueResponse                
    -- Error message on parse failure
    parseJSON a = AS.typeMismatch "Expecting JSON object for Issue" a

fetchTestIssue :: IO ()
fetchTestIssue = do
        -- We use a demo instance of Jira available on the web
   let  _host =  "https://jira.atlassian.com"
        -- We use the latest version of REST API to select one of the issues
        -- and we limit the returned fields to be the summary and description only
        uri  = _host ++ "/rest/api/latest/issue/DEMO-3083?fields=summary,description"
        -- This is just to ignore the failure of the security certificate
        settings = mkManagerSettings (TLSSettingsSimple True False False) Nothing 
   -- We make the request by parsing the URL
   request  <- parseUrl uri
   -- do the request
   response <- withSocketsDo $ withManagerSettings settings $ httpLbs request
   -- Get the response body. 
   -- Decode it as IssueResponse type possibly failing
   -- If decoding was successful turn the result into an Issue type
   -- Encode the whole result (possibly failed) as YAML
   -- Print the resultant ByteString to the console 
   B.putStrLn . YAML.encode . fmap issueFromResponse . AS.decode . responseBody $ response

The result printed out would look like this:

issueDescription: You have to delete zero first before you can put in your Dosage
issueId: 333132
issueKey: DEMO-3083
issueSummary: ! 'Backspace to delete zero to enter your dosage '

Creating the document

Pandoc

Pandoc is a Haskell library and command line utility that allows you to read several markup formats and write several markup/document formats.

It can read the following:

and it can write the following:

All the readers parse to the same abstract representation and all the writers consume this abstract representation. So it is very modular, since all you have to do to support a new input format is add a reader and it can output as any of the writer formats, and similarly the other way around. The abstract representation of Pandoc is also ideal when you want to programmatically generate documents, which is exactly what we want to do.

Here is an example of programmatically generating a Pandoc document and then writing it out as Pandoc markdown and HTML.

import Text.Pandoc
import Text.Pandoc.Builder hiding (space)
import Text.Blaze.Renderer.String
import qualified Data.Map as M


-- We use the helpers to construct an AST for a table with some text in it
aTable :: [Block]
aTable = toList $ -- convert the builder type to the AST type
            -- Create a 2 column table without a caption a aligned left
            table (str "") (replicate 2 (AlignLeft,0)) 
                -- The header row for the table
                [ para . str $ "Gordon", para . str $ "Ramsy"]
                -- The rows of the table
                [ [para . str $ "Sally", para . str $ "Storm"]
                , [para . str $ "Blah",  para . str $ "Bleh"]
                ]

-- Create our document along with its meta data
myDoc :: Pandoc
myDoc = Pandoc (Meta M.empty) aTable

main :: IO ()
main = do
    -- render as Pandoc Markdown
    putStrLn $ writeMarkdown def myDoc
    -- render as HTML
    putStrLn $ renderMarkup $ writeHtml def myDoc

The output is:

$ ../test/PandocEx.exe
  Gordon   Ramsy
  -------- -------
  Sally    Storm
  Blah     Bleh

  :


<table>
<caption></caption>
<thead>
<tr class="header">
<th align="left"><p>Gordon</p></th>
<th align="left"><p>Ramsy</p></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left"><p>Sally</p></td>
<td align="left"><p>Storm</p></td>
</tr>
<tr class="even">
<td align="left"><p>Blah</p></td>
<td align="left"><p>Bleh</p></td>
</tr>
</tbody>

Parsing Jira markup with Parsec

The description of a Jira issue is formatted using Jira markup and we wanted to have the same formatting that you saw in Jira in the generated document. Unfortunately there is no reader that can convert from Jira markup to Pandoc’s AST. This meant that I had to write parser for Jira markup. Writing parsers are well supported in Haskell and the library to use is usually Parsec or one of its variants.

From the Haskell wiki we get the following “Parsec is a monadic parser combinator library and it can parse context-sensitive, infinite look-ahead grammars but performs best on predictive (LL[1]) grammars.”

You build more complex parser by combining smaller parser using the provided parser combinators. Your final parser is run against some input and it produces some values. For more complex parsers you can pass in user state to be used while parsing. Below is an example of a simple search and replace parser.

import           Text.Parsec.Char
import           Text.Parsec.String (Parser)
import           Text.Parsec.Prim hiding ((<|>))
import           Text.Parsec.Combinator
import           Control.Applicative
import           Control.Monad
import qualified Data.Map as M
import           Data.Maybe

-- Parse an issue description and replace the issue references
replaceIssueRefs :: M.Map String String -> Parser String
                     -- multiple times parse either a link or normal text
                     -- and then concatenate it all into a single string
replaceIssueRefs m = concat <$> (many1 . choice $ [issue_ref, normal_text])
    where
        -- normal text is any character until we reach an issue reference or end of file
        normal_text = manyTill anyChar (lookAhead (void issue_ref <|> eof)) 
                      -- check that this parse does not except empty text
                      >>= \txt' -> if null txt' then fail "" else return txt'
        -- match the string DEMO- followed by at least 1 digit
        -- lookup the matched string in the map replacing it
        -- if the parser fails consume no input
        issue_ref = try $ fromJust . (`M.lookup` m) <$> ((++) <$> string "DEMO-" <*> many1 digit)

main :: IO ()
main = do
        -- The map of issue references to replace
    let m = M.fromList [("DEMO-132", "OMED-457"), ("DEMO-987", "OMED-765")]
        -- The input issue description text
        s = "See issue DEMO-132 for more information related the bug listed in DEMO-987"
    -- parse the issue description using the replaceIssueRefs parser
    case parse (replaceIssueRefs m) "" s of               
                        Left e -> print e       -- on failure print error
                        Right rs -> putStrLn rs -- on success print out:
    -- See issue OMED-457 for more information related the bug listed in OMED-765

My Experience

Initially avoiding having to think

In my experience the fact that a lot of Haskell code is declaritive allows you to get quite far just gluing things together without necessarily having to solve anything in a functional paradigm. You are helped along in this by the type system which acts like safety rails guiding you to compose things together correctly.

Of course eventually you have to solve something, you actually have to think, and because of my imperitive background this was a hurdle in the beginning. I remember having to do something simple but being stuck not knowing how to go forward. I had to adjust my perspective a bit, but this is normal for anything new that one learns. I did mis having access to a printf or a debugger while I was trying to readjust my world view. I did discover Debug.Trace but because of Haskell’s lazyness this didn’t always help that much and I found that the runtime errors were left wanting.

Error reporting

I like the idea of let it break; make your assumptions; assert on them; run the whole tooty; and see where you were mistaken. This is actually a problem in Haskell, because you do not have a stack trace, your runtime error is not very helpful. I can appreciate why it is a problem in Haskell, because your semantic stack is not usually the same as you execution stack. You have a lot of higher order programming going on along with lazyness and optimizations that will reoder your code. That said there are somethings you can do like recompiling with profiling on, but I imagine this won’t really help you in production. Simon Marlow gave a good talk about this that you can catch on youtube “HIW 2012. Simon Marlow: Why can’t I get a stack trace”.

In the end I guess in Haskell you should explicitly code for your failure conditions and generally just think about what you are doing, which is probably not a bad thing.

Printf with Debug.Trace

Debug.Trace is a printf escape hatch for your pure code in Haskell. So for someone like me, that sounded like the best thing since sliced bread. The problem is that trace only emits when the statement that involves it is evaluated, and since Haskell is lazy it often never emits. It was a pain trying to debug my parser, which I should probably have written better and tested with quick check.

I ended up using a very ugle dirty hack to force the printing of my debug strings. I forced the evaluation of my trace by stringing my parser state through it. Even though this was a dirty hack it did actually help me to understand all my misconceptions.

-- ......
type MyParser = Parsec String ParseState
-- .....
-- Really gross but worked.
-- Force trace to emit by requiring subsequent parser actions
-- to access the parse state through my trace message
traceM' :: String -> MyParser ()
traceM' msg = getState >>= (\s -> return $! trace ('\n' : msg) s) >>= setState 

In the end I actually really like Haskell

Minor gripes asside my feelings about Haskell are very positive. Some people are scared off by the operators and strange sounding typeclasses but I found Haskell to be very consitent. There are only a few idioms and oprators to learn and they are used all over the place, in the same way, and you can expect the same behaviour. The libraries seem to be very composable and that they converge on convention and style. Combine the succint code with a type system that gives you confidence and in my book you have a winner.

The source

If anyone is interested the source code for the tool can be found here https://github.com/HanStolpo/JiraStructureToDocx but be warned the quality is very poor. It is just good enough as an internal dev tool for us and, it was a learning experience.

The slides for the talk

The slides for the talk can be found here Slides