Tuesday 2 November 2010

Pornography: How much is there on the Internet?

Just the facts, ma'am
 - Sgt. Joe Friday, Dragnet

When I started this little endeavour, I had the intention of clearly and accurately presenting some pure unadulterated facts. I didn't really quite grasp the magnitude of this exercise. Despite my waxing enthusiastic about the Internet and how all of us are contributing our knowledge to the sum total of the knowledge of the entire planet and about Google (see my blog Google) and how it in organizing all this information allows all of us to be able to actually find stuff in amongst those zillion pages, I have discovered that misinformation is still rampant and we have a long way to go before we eradicate ignorance if not stupidity.

In my blog Pornography: Statistics Laundering I discuss some of the misinformation, exaggerations and outright self-serving lies which are floating around relating to this topic. I am somewhat outraged to find that many of the doomsday stats presented by what appear to be ring wing conservative religious groups seem to be in no way representative of reality. In fact the gross exaggerations of the situation seem to be designed to scare the electorate into voting for public policies which would greatly curtail our freedoms. Far be it for me to dictate what you do in the privacy of your own home between consenting adults.

A little warning: Readers of this blog will note that I usually write out profanities as f**k. This isn't out of prudishness; this is deliberate on my part because I think it's a bit funnier just in the same way I find Jon Stewart of the Daily Show being bleeped on regular television a bit funnier than actually hearing the word. [laughs] Okay, maybe that does make me prudish? In any case, for the sake of clarity in the lists below, I spelled out the words in full: a break with policy. See my blog I suck; you suck; we all suck. What!?!

The size of the Internet: 24 billion pages
First of all, the following is probably not the be all and end all. I am a devoted user of Google but that doesn't mean that Yahoo or Bing is not without merit. Due to time, I have to restrict myself to something and I think choosing the mother of all search engines seems like not an incorrect choice.

Oddly enough or maybe not oddly enough, there is quite a bit of opinion on the size of the Internet. Once again I am thwarted in going to a single definitive source of information and must make do with an estimate which I'm sure will be contested.

Size of the Internet: 24 billion web pages
When you know, for example, that the word 'the' is present in 67.61% of all documents within the corpus, you can extrapolate the total size of the engine's index by the document count it reports for 'the'. If Google says that it found 'the' in 14,100,000,000 web pages, an estimated size of the Google's total index would be 23,633,010,000.

This gentleman seems to present a compelling calculation however what catches my eye right off the bat is that my query on the word "the" matches what he says in his article with slight variances from day to day. I'm not 100% sure that his statement is true: the word "the" appears in 67.61% of all web pages but a cursory test of a couple of web sites seems to yield similar percentages. However I note that 14.1 billion is 59.66% of the authors 23 billion number, not 67.61%. Whatever. For the purposes of the following calculations, I am going to go with 59.66%. Since I just queried "the" and ended up with 14,470,000 hits, I will calculate the total pages as being 24,253,167,000.

The amount of pornography on the Internet: less than 1%
I used an Excel spreadsheet to track this information and calculate the percentages. The method is simple: go to Google; type in the word and hit Enter; note down the number of results shown under the Search box then calculate those results as a percentage of the estimated total pages index by Google. Here I am using the number of 24,253,167,000 as the total number of pages. FYI: These numbers change from moment to moment, day to day so I doubt you will get exactly the same numbers but they will be similar.

word# of hits% of total

So, how do the "dirty" words stack up? I'd say that we are collectively being more literary than not in our communication and not resorting to profanities. You will note that my list is a nod to Mr. Carlin. See my blog Censorship: Kill me but no sex please

word# of hits% of total

The following target various words which are more than likely related to pornography. However, let's keep in mind that searching for a word means it appears in the web page but we don't exactly know what the content of the page is. Let's admit right up front that anybody writing a blog, for instance, could be using a word without it technically being a related to pornography. Nevertheless, for the purposes of this "estimate", I think it gives a good look of these words in relation to the whole Internet.

word# of hits% of total
"anal sex"13,700,0000.06%

In looking back on the word sex, I noted I was getting hits for things like Sex and the City which wasn't what I was shooting for. So, in an effort to filter out anything which used the word sex but wasn't related to pornography, I added a word to search on sex AND XXX then sex AND porn. As you may or may not known, web developers will add keywords to META tags to better increase the accuracy of page indexing and those keywords will target the material in question. So while the hits for the word sex are all encompassing, it seems the hits on the word sex relating to pornography are significantly lower.

word(s)# of hits% of total
sex xxx69,400,0000.29%
sex porn77,000,0000.32%

The "quantity" of pornography on the Internet seems to be less than 1% of the total web pages published.

I'm sure that others may object to this rather simplistic methodology but from the table above, any of the porno keywords are clearly coming in at less than 1%. I find it hard to believe that web sites, web pages offering pornography whether it be pictures, movies or erotic stories are not in some form using one or more of the above keywords.

Caveat #1
The above searches all pages, that is, any language. I did experiment with Advanced Search trying to zero in just on English pages but found the results a tad odd. The word sex normally returns around 570,000,000 hits but with Advanced Search in selecting English only, I ended up with 615,000,000. Hmmm, I would have expected a lower number. Beats me why but I don't think this affects the outcome of my unscientific scientific research.

Caveat #2
We can debate the total number of pages however the overall results will be the same. The porno words just show up as a low percentage.

My investigative series on a controversial topic
1. Statistics Laundering
2. How much is there?
3. Searching for what?
4. What is it?
5. Does it lead to crime?
6. Defended by... what!?! Feminists?
7. Who buys the most? Conservatives!
8. Does it lead to crime? Part 2
9. Is it an addiction? (to come)
10. My conclusions (to come)


Pornography: My investigative series

The Straight Dope
How much of all Internet traffic is pornography? - October 7, 2005
This gentleman gave me some of the ideas for doing my investigation. While his numbers are five years old, he concluded his study about the amount of pornography on the Internet by stating, "I'd say we've got an unremarkable list of life's little pleasures, whether online or off."

Indexable or number of web pages on Internet
This gentleman presents 3 estimates. As I said, method #2 is confirmable.
  1. Supposedly from Google dated 2008: 1 trillion
  2. The method I used above: here around 27 billion; I used 24.
  3. His own method combining 1 and 2 to arrive at 48 billion.

Snopes: Just the facts, ma'am.
According to our urban legends debunkers, this particular phrase was never uttered by Sergeant Joe Friday. Apparently he said, "All we want are the facts, ma'am." but over the years, the line being repeated and accidentally modified by so many people, the popular version has become the stable of folklore.


No comments: