When is it okay to grab data from someone else’s website, without their explicit permission? A new ruling by a federal judge in California might have dramatic implications on this question, and on the open nature of the web in general.
As reported in several outlets (including The Verge, Engadget, and The Register), the ruling was made as part of a lawsuit involving LinkedIn and a startup called hiQ Labs. HiQ has developed software that uses advanced algorithms to detect when a LinkedIn user is looking for a new job, based on changes in the information available on his or her public profile.
These algorithms run on data that hiQ scraped from web pages on LinkedIn, much to the chagrin of the latter – which applied various technical measured, as well as some sternly worded legal warnings, to stop hiQ’s bots from continuing their scraping efforts.
This series of events lead hiQ to turn to the court, demanding LinkedIn remove the technical barriers it had established to prevent hiQ’s use of data, and that the court declare that hiQ’s scraping actions were legal.
Last Monday, a federal judge ruled in favor of hiQ, prohibiting LinkedIn from “preventing hiQ’s access, copying, or use of public profiles on LinkedIn’s website (i.e., information that LinkedIn members have designated public).” LinkedIn was further ordered to remove any technical or legal mechanisms put in place to prevent such access, and to refrain from implementing these means in the future.
To use less convoluted terms, LinkedIn was told to allow hiQ to scrape any data from LinkedIn profiles that could be accessed without logging in to the service.
Taking a Stand in Favor of Open Web Data
It’s important to note that this ruling was made as a preliminary injunction – a temporary court order meant to protect the rights of the parties during the legal deliberations. In other words, it’s not the final decision, which might eventually be in favor of LinkedIn.
However, it’s still interesting to see the way the District Judge Edward Chen came to his conclusion. LinkedIn raised various arguments against hiQ’s crawlers – for example, the claim that hiQ was breaching its users privacy, which the judge shot down based on the fact LinkedIn itself is apparently selling this data. We won’t go into all the ins and outs of the decision and you can read the full text here (PDF) if you’re interested, but it is worth noting Judge Chen’s opinion regarding the legality of web scraping.
Essentially, LinkedIn claimed that hiQ was violating the Computer Fraud and Abuse Act (CFAA) by unlawfully accessing the servers where its data was hosted and obtaining unauthorized information. The court dissected and eventually rejected this argument – essentially ruling that hiQ was allowed to scrape any information that appeared on a public, non-password protected page.
To reach this conclusion, the court adopted a narrow interpretation of the law, equating a public website to a store in which the door is open. Someone who walks through that door, during business hours, is not trespassing – and in that sense hiQ’s bots were not trespassing in LinkedIn’s property. In this context the court says:
A user does not “access” a computer “without authorization” by using bots, even in the face of technical countermeasures, when the data it accesses is otherwise open to the public.
Judge Chen continues to explain why it is in the public interest to not make any web scraper into a criminal or hacker:
In view of the vast amount of information publicly available, the value and utility of much of that information is derived from the ability to find, aggregate, organize, and analyze data.
…Conferring on private entities such as LinkedIn, the blanket authority to block viewers from accessing information publicly available on its website for any reason, backed by sanctions of the CFAA, could pose an ominous threat to public discourse and the free flow of information promised by the Internet.
The bottom line is this: in addition to other legal arguments which we haven’t covered in depth here, the court acknowledges that it is in the public interest to allow web data to be crawled, gathered and analyzed – and that placing obstacles and barriers to prevent this activity would be a detrimental policy decision.
Our Take on the Matter
Because Webhose.io is dedicated to providing a 100% white-hat solution to web data harvesting, we actually don’t extract LinkedIn data. We prefer to play it completely safe and stay out of any website where we’re deemed unwanted by the robots.txt file, as well as any information that’s behind a login screen.
So while this decision isn’t directly relevant for our own web data business, it is definitely great to see a federal court acknowledge the strong public interest in keeping data on the open web public, accessible, and available for those who can use it to create innovative products and research.
We believe web data can be an awesome resource for data scientists, developers, and media analysts – and they should continue to enjoy access to it, without overly restrictive and cumbersome processes standing in their way. In that sense, the hiQ v. LinkedIn ruling is great news – and we hope to see more decisions like it in the future.
This post represents the author’s personal opinion and should by no means be taken as legal advice.