LocalBlox, a company that scrapes data from public web profiles, has left the details of over 48 million users on a publicly accessible Amazon Web Services (AWS) S3 bucket, according to an UpGuard security researcher who discovered the data on February 28, this year.
Upguard further stated that the company secured the server on the same day, after the researcher contacted the firm.
"The bucket contained one 151.3 GB compressed file, which, when decompressed, revealed a 1.2 TB ndjson (newline-delineated json) file," UpGuard said yesterday in a report summarizing its findings.
Based on the exposed file's name —final_people_data_2017_5_26_48m.json— this appears to be a backup of the LocalBlox database made on May 26, 2017.
LocalBlox claims on its website that it is capable of offering a "true 360 degree people view" by "marry[ing] work-life and personal-life individual data to generate combined intelligence."
UpGuard, who spent the past few weeks analyzing the data, says the LocalBlox archive it found contained data scraped from public profiles on sites like Facebook, LinkedIn, Twitter, and real estate site Zillow.
The JSON-formatted file contained names, physical addresses, dates of birth, (LinkedIn) job history, Twitter handles, and in some cases IP and email addresses.
Facebook profile data was also included, and based on the format of the data, UpGuard suggests this data might have been collected using the social network's search feature that allows users to find profiles based on an email address, a feature that Facebook has recently discontinued in the light of the Cambridge Analytica scandal.
LocalBlox appears to have used this feature to identify user profiles and then collected the details available in users' public profile. Collected details varied and could include names, pictures, skills, current job, companies (employer), family details, and other.
This incident is technically a data leak, but is not, as well. LocalBlox suffered a leak by leaving the file on a misconfigured AWS server, but the exposed data was already known information.
All the data appears to have been collected by scrapping the respective sites' HTML code, rather than using APIs, which are locked down under strict legal terms that prevent mass scraping.
Facebook, Twitter, and LinkedIn also contain language in their public sites' terms of service that prevent the scraping of public pages. But in recent years, US courts have sided with data scraping firms in lawsuits filed by social networks, suggesting that data published in public profiles does not fall under copyright or privacy protection laws.
Following the intense media coverage of the Cambridge Analytica scandal and the subsequent consequences of third-party firms collecting data on social networking users without authorization, LocalBlox did not appear to take the publishing of the UpGuard report lightly.
In a phone call with a ZDNet reporter, chief technical officer Ashfaq Rahman claimed UpGuard "hacked" into its S3 bucket, said that most of the data was "fabricated" and used for internal testing only, and that nobody but the UpGuard researcher accessed it.
Image credits: LocalBlox website