Ancestry.com scrapes websites; places harvested content behind membership wall
Bloggers covering this story: The Genealogue, Kimberly Powell at About.com Kinexxions, Cow Hampshire Genea-Musings, Untangled Family Roots, Ancestories, Jessica’s Gene Journal, Creative Gene, and Craig Manson.
Janice at Cow Hampshire has a good discussion about commercial use and examines Ancestry.com’s text describing their subscription-based service.
Becky Wiseman at Kinexxions provides screenshots and a walk-through of the method in which the ripped-off data is presented:
The “Internet Biographical Collection” jumped out at me. Notice the padlock? I clicked on that link, but this is a “for pay” subscription database, and since I wasn’t logged in I couldn’t see the detail any more than the listing of pages, all of which, except for the last one, are from my website and they are definitely NOT part of Ancestry.com!!!
(I’m reading that Ancestry is no longer placing its scraped Internet Biographical Collection behind its paywall. But they still harvest the site, and your contact information in order to allow you to see it. If I’m to even do due diligence to see if my site has been scraped (I doubt it; I don’t have strict ancestral information here), I must create an account and log in.
I have some technical questions about the bots they’re using, and how a technical defense can be put up. A visit to netcraft and entering ancestry.com results in some 44 sub-domains, including such server names as search.ancestry.com, content.ancestry.com, boards.ancestry.com, trees.ancestry.com, landing.ancestry.com, id.ancestry.com, awt.ancestry.com, adserver.ancestry.com, awtc.ancestry.com, c.ancestry.com, and so forth. Which server is the one doing the scraping? And what form does the bot take? I’d like to know.
Okay, there’s the bad PR counterattack (already in progress), and the technical defense countermeasure. But I after visiting the site’s site’s home page, I decided that parody is the best revenge. A bit of screenshotting, and Photoshoppery was called for.
Here’s my spoof image, full-size. Snag it and post it on your own site, if you’d like. (and link here, ahem!)
(If you’re visiting here because you’re a lawyer punching Ancestry.com’s billable hour clock, bear in mind what parody is, and skedaddle.)
UPDATE UPDATE UPDATE: Found the culprit.
It’s a MyFamilyBot (not to be confused with a Roomba) (hat tip to commenter AnnieGMS in the comments at Geneamuings)
Here is the Ancestry.com page describing the bot. It follows the Robots.txt protocols. The challenge is how to construct a text file called robots.txt, placed in the root of your site, so that it excludes the MyFamilyBot and allows free reign to other bots (such as, say, Google’s/Yahoo/etc.).
How do I prevent MyFamilyBot from crawling my site?
MyFamilyBot supports the internet standard protocols for restricting spiders from crawling web sites. These protocols are described here:
Here are a couple of user agent records for the MyFamilyBot:
Mozilla/4.0 (compatible; MyFamilyBot/1.0; http://www.myfamilyinc.com) [source]
Mozilla/4.0 (compatible; MyFamilyBot/1.0; http://www.ancestry.com/learn/bot.aspx; SearchBot@MyFamilyInc.com) [source]
The exclusion admin page (how web administrators can exclude robots) says this is the form to use to exclude one type of bot:
I guess it’d look this this [note: slight change here, see comments]:
That’s all fine and well and good if you can place a robots.txt file on your web server. (Well, there may be fine tunings to do, too. Robots.txt is not my specialty. I may post over on my 2020 Hindsight site about this. I get smart geeky readers there.)
What if you can’t get to the root level of your site? There’s this webmaster how to do it in the web page direction. But it looks like all or nothing (deny or allow all robots).
And if you’re on a site hosted by blogspot (many of the trespassed sites are), then I don’t know how to help, because I haven’t mucked around with the back end there.
p.s. Thanks for coming by and reading! I, er, put a lot of effort into a previous post for the day—how I’m processing a bunch of family letters. Please check it out!
UPDATE UPDATE UPDATE—Ancestry.com has pulled the Internet Biographical Collection
Ancestry responds to the controversy, takes its controversial collection offline. Or at least dismantles entry points. I’ll have another post about this.
Great job, Susan! I’ve posted a link to this on my post “Ancestry.com: Copyright Violations?”
Oh good, Miriam. I’ll add your post to the who’s flogging and blogging the issue.
Take the image. Take it. Use it if you’d like But hey, I’ll take link love, too. I always like link love. :D
Boy, that’s interesting. I wonder if this collection is available through the Ancestry Library Edition. Because I don’t particularly want to give them my information just to see whether or not they’re ripping off my site, but I live a block from a library with free access to the Library Edition. I’ll have to try that in the next few days.
Ralph, I’m hitting on a solution in robots.txt. I’m composing an update to this post with more.
Have you done much in the way of robots.txt work and can you double-check my update when I do? (was gonna take the geekdetails to 2020 b/c readership has more geekitude there) Oh, and how in the world this is going to work for sites hosted on blogspot.com I have no idea.
Hey Maybe all you great bloggers should
Bill Ancestry for A Corp Subscription to your Blogs
And then charge interest for past due and late fee’s
Oh and by the way neclect to cancel there subscription .
Fair is Fair ..
Thank you so much for not only the wonderful article, but for the priceless gift of laughter (i.e. your spoof on Ancestry.com).
I’ve added a link from my article to yours as a “Must Read.”
Boy aren’t all of our tounges wagging on this one. Great post Susan. Lot’s of information. Someone had mentioned how to blog them in another group I am in, but didn’t give instruction to do so. So thanks for the instructions.
YOU GO GIRL! While I’m ranting and raving, you’re doing something useful. Thanks for the robots reminder - I’m updating mine this morning.
Susan, your robots.txt looks good; I would probably delete the “/1.0” from the specification, because it would stop working when they update the bot to version 1.1.
It looks like Blogspot-hosted blogs have a Hobson’s choice; they can exclude all robots by including a META tag in the header of their templates, or they can allow all robots, including the MyFamilyBot, by leaving the META tag out. The notes from the meeting that defined the META tag robot exclusion protocol note that the attendees decided “not to add syntax to allow robot specific permissions within the meta-tag”. Short sighted decision, but who knew in 1996….
Here is a rant I posted on Ancestry’s own blog on this topic. I’m not sure if they will leave it up.
Obviously, what people put up on the web for free is there to be shared, yet it is still under copyright. The idea that if people don’t want this to happen they should start their own paid sites is neither very practical nor the way the real world works. If it was, there would be no public libraries where books could be checked out free of charge. If I was free to scan the latest Harry Potter and make it seem to be mine just because it was freely availible in a library the world would be a much sadder place.
The differences between this and what Google does are many and are all of a type that put Ancestry on the wrong side of the ethical divide. Google does not claim the content for themselves or attempt to mislead people into thinking that it is Google’s by slapping the Google brand on it, did not charge people to see other people’s content before being flogged into submission and does not use other people’s data and creative output to harvest email addresses.
In some cases free content is put up to attract visitors to a site. Either for visitors to see the work as a whole in its full context as the copyright owner sees fit or so that the owner of the material presented can receive some sort of compensation for their hard work. That might be anything from money from AdSense advertising to asking people to register for a news letter or presenting them with services they can purchase or even asking for a donation to help pay the hosting bill. On the internet, traffic and contact information are part of the game. No one is putting out free content with a motivation to drive traffic to Ancestry- that is what Ancestry ads on genealogy sites are for.
Yes, Google caches but it presents links to live content first and in an eye catching manner, the link to anything cached is small and comes later. Google, therefore, drives traffic TO people’s web sites far, far more than it keeps traffic FROM people’s sites. Many people’s free content is free because it is a labor of love and/or a work in progress. A person’s labor of love should not be scraped (plagerized, suffer copyright infringement), claimed by someone else and covered with someone elses branding. If a private person engages in this behavior they will find themselves on many a blacklist and their account blocked by their web hosting company.
Additionaly, cached pages do not keep up with progress as well as the real thing and so are inferior unless the real thing disappears entirely. Then again, it may disappear entirely for a reason. I would not hesitate to remove something that I later discovered to be in error and I for one want that control over my creation. Its bad enough that I might have misled people who saw it, it is much worse if I can no longer remove the source of misinformation and people will go on being misled into the distant future.
If Ancestry wanted to add value without stealing it could- 1) present links to live sites in preference to its own cached pages or better still only present cached pages if the site is no longer availible 2a) clearly indicate that they are acting as a search engine service provider not an information OWNER 2b) not being the owner, Ancestry should not present a full summary of the data but instead give just enough info for the user to know if they wish to continue to the actual site, ie. do it like Google does it. 3) clearly supply information on how a person or institution can have their content removed from Ancestry’s cache if they so desire 4) Clearly indicate how a person or institution can prevent their content from being cached in the future 5) allow people to opt in e.g submit their dying site if they will no longer present their information online by themselves and they agree to allow Ancestry to keep the information alive 6) not require people to give their email address in order to look at someone elses information- Google manages to do without that and they don’t have paid subscribers. This would ethically provide a service which attracts people to their site and entices them to become subscribers and at the same time do something that has a chance of being appreciated by the site owners.
Genealogists are often not taken seriously because we don’t cite sources and use other peoples work without giving them credit. Ancestry should not be in the business of encouraging that sort of behavior. Shall beginners be taught that it is ok to grab other people’s data and conclussions and make them seem to be their own? Is that a good future for our beloved pastime?
I for one will be spending my genealogy budget elsewhere unless this is rectified immediately. I am not a disgruntled webmaster nor am I employed by Google. I have no site of my own and have therefore not had my content hijacked. Nevertheless, I think any responsible member of the genealogy community should be deeply angered.