The Blog

Article’s publication date extractor – an overview

Posted on December 13, 2015 by

A few days ago I’ve released an open source Python module that provides you with a simple way to extract and normalize the publication date of any online blog or news post. There are some commercial solutions out there, but why not just use this module for free?

 

The logic behind the code

Here at Webhose.io we use multiple methods to automatically detect and extract the date out of articles, blog posts and comments. A publication date can appear in various ways and multiple formats. It can be based on a numerical format (i.e 01/02/2015), a textual format (i.e Yesterday), or even a combination of them both (Jan 1st, 2015). Not to mention that there could be multiple types of separators and the following date, 01/02/2015, can be interpreted as January 2nd, or February 1st (depending if it’s the American or European format).

Fortunately there are standards out there. Unfortunately, there are A LOT of standards! The date extraction function tries multiple methods to accurately extract and normalize the date.

Try the URL
More often than not the date exists in the URL of the post, but since it doesn’t include the time, we try to extract it as a fallback, in case other methods fail. We use a regular expression to try and match against multiple formats (1/1/2015, 1-1-2015, 1.1.2015,1_1_2015).

Here is the regular expression we use:
([\./\-_]{0,1}(19|20)\d{2})[\./\-_]{0,1}(([0-3]{0,1}[0-9][\./\-_])|(\w{3,5}[\./\-_]))([0-3]{0,1}[0-9][\./\-]{0,1})?


Try JSON-LD
JSON-LD is an easy-to-use JSON-based linked data format that defines the concept of context to specify the vocabulary for types and properties. Some documents specify the creation or publication date using this methods, it’s always worth a try!

JSON-LD markup example:

{
“@context”: “http://www.w3.org/ns/activitystreams”,
“@type”: “Create”,
“actor”: {
   “@type”: “Person”,
   “@id”: “acct:sally@example.org”,
   “displayName”: “Sally”
},
“object”: {
   “@type”: “Note”,
   “content”: “This is a simple note”
  },
  “published”: “2015-01-25T12:34:56Z”
}

 

META to the rescue?
If JSON-LD fails (it usually does), we try to look in the document’s meta tags for the date. There are many types of meta tags (a lots of standards remember?) so we try to go over all of the different formats.

Some META tags examples:
<meta name=”article.published” content=”2015-11-26T11:53:00.000Z” />
<meta property=”bt:pubDate” content=”2015-11-26T00:10:33+00:00″>
<meta name=”DC.date.issued” content=”2015-11-26″>
<meta name=”pubdate” content=”2015-11-26T07:11:02Z” >\

 

Last resort – the HTML
With the risk of loosing accuracy, if all fails we look into the HTML. A mixed of standards and popular date annotations are evaluated in order to find the elusive date:

<span class=”timestamp”>01/02/2015</span>
<div class=”pubdate”>Jan 1st, 2015</div>

 

Unifying the date

Once we find the textual date, we unify it using the excellent python-dateutil module. It’s an amazing solution that converts textual date, into a datetime object.

 

Parsing the document

In order to parse the HTML document, we use Beautiful Soup. It has a powerful parsing capabilities, and it’s very simple to use. For the JSON-LD part, we use the built in JSON module, to load and parse the JSON.

 

Precision and Recall

We tested the “Article Date Extractor” module against Google’s news feed, and got close to 100% precision with almost 90% recall. You can of course increase this recall by adding more patterns to the HTML extraction function, but you are risking in a lower precision score.

 

Contribute

That’s it, feel free to share it, use it, and contribute if you feel you can make this module better.

 

Share this:
Share on FacebookTweet about this on TwitterPin on PinterestShare on LinkedInShare on TumblrShare on Google+

This entry was posted in Big Data, Technology. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *