Discussion of Online Advertising, CPA, SEO, Affiliate and Next Generation Marketing
  • NAVIGATION
  • TOPICS
  • THE REVENEWS BLOGGERS
  • QUICK CONTACT
ReveNews Online Revenue News & Opinions Since 1998

SEO Matrix’s- Markov Chaining and Term Vector Models

January 5th, 2008 by Heather Paulson

2008 is upon us and if I hear Happy New Year one more time I am going to scream! To take my mind off the trauma that was the New Year Eve of 2008 I thought I would rant a bit about SEO “yes, apparently I am going to rant now, yeah!”

I was asked once by a company I worked for to write down my “list” of what it takes to provide quality SEO - my best practice notes if you will. I realized how impossible it was to try to explain Markov Chaining and page rank factors, Latent semantic indexing (LSI), as well as the organization it takes concerning density percentages of keywords being weighted across multiple points of bot scrutiny both with on-page content as well as the factors for the incoming anchor text - etc. It is an extensive list of requirements and each SEO professional has developed their own best practice items that they approach a natural search engine optimization project with. I have developed a content matrix that I use to apply keyword density patterns that follow suit of the Term Vector Model formulas and Markov Chaining principles and include LSI patternage of keyword spread on a web page as well for heightened PR and keyword placement. Each site and each vertical is different…Pattern, density, application…Identify the pattern and you can beat the algorithms.

Markov Chaining and Term Vector Models

Google’s page rank algorithm uses Markov Chaining to help calculate PR scoring
Wikipedia defines Markov Chaining as follows - “The PageRank of a webpage as used by Google is defined by a Markov chain. It is the probability to be at page i in the stationary distribution on the following Markov chain on all (known) webpages. If N is the number of known webpages, and a page i has ki links then it has transition probability (1-q)/ki + q/N for all pages that are linked to and q/N for all pages that are not linked to. The parameter q is taken to be about 0.15.”

Term Vector Modeling is an algebraic model for representing text documents, I am not going to get into all the details of TVM or Vector space modeling but I will say when you combine the principles of Markov’s page rank chain and include the Term vector modeling principles of term weighting within your linking chain and within your web pages content you get some pretty great results concerning SEO.

Markov Chaining points to consider for optimal page rank

1) Anchor text of your incoming links (Use keywords in the anchor text you want to actually acquire search placement for, you should have a high keyword density count on the page the link is coming into within all points of bot scrutiny)

Example
markov_paulson.jpg

2) Ensuring keywords are applied within your H1 - H2 etc and meta content (Yes this means adding keywords in your description tag as well) and density of keywords comparative to your incoming links anchor text and the content of your own site should be balanced by a percentage of density that you must pre-configure.

3) Anchor text and path nomenclature of your internal link structure should “match” for heightened relevancy of the “term” to increase the likelihood of search placement for that term (Do they match? Example: anchor text is: “Custom Footwear” and path nomenclature is: “http://www.yoursite.com/custom_footwear.php and meta title = Custom Footwear and H1 on the page is “Custom Footwear”, and - and - and.)

4) Relevance is based on the density of semantic and LSI (Latent semantic) words on your page - It’s not how MUCH content (And content density patterns change per vertical) it’s the tfi = number of times a term i occurs in a document and Li = total number of terms in a document. The keyword density of a 600-word document that repeats the term “footwear” 6 times is KD = 6/600 = 0.01 or 1%; it is also the WHERE or which region on the page or in the code the term is applied. Google also weights keywords from your pages; image nomenclature, folder nomenclature, video, document files, etc. (What are you naming your images? - DOH)

Tip on keyword page weighting: Some regions on the web page are given a higher relevancy uptick or relevancy factor for keyword weighting then others - remember back in the early 2000’s (Pre Google’s Florida Storm) how we killed it with a HUGE H1 tag above the first table? NOW each vertical has different relevancy regions look for them in your vertical.

5) It is not how MANY links are coming in, it is the link QUALITY of the links (Amount of PR Juice you get from them - PR juice is dependent on the incoming links site(s) or pages own factors) Using spamming link techniques does not work.

6) One of my secrets on generating incoming links is finding a high PR site optimized for the “base root” term for the entire sites web page content that I want a higher quality or relevancy factor for. So if the weighting or relevance needs to be higher for the term “custom” over the term “footwear” I gain incoming links to weight relevance according to the “term” that has the largest spread for additional keyword terms on my site. So in essence I want to achieve heightened parameter for Markov chaining from a high PR site that is optimized for the root term that I have weighted heavily using Term vector weighting on my websites pages.

A few of my suggestions for increasing your sites search engine placement and or PR

1) Use absolute links on all internal links of your website.

2) Use an organized code and on-page content matrix to organize the density or spread of your keywords across your websites pages (I cluster 5 root terms per gage with LSI thrown in to achieve multiple keyword placement in the engines- Your percent of keyword weighting is dependent on your documents word count - formula should be determined based on competitor evaluation - keyword spread “where you put the keywords” is dependent on regions of bot scrutiny for your vertical)

3) Use -no follow- attributes within internal links in your site for pages you want to stop throwing away PR juice at (C’mon do you REALLY want your “contact us” page to take away PR juice from your home page?)

4) Google’s universal search has added social media to the Markov Chain mix: blogs, video, press, all hold PR weighting and should be considered. (Is you blog rss in Technorati? Or bumpzee?) Start being more social!

5) Per page text clustering is an iterative process - developing a site content matrix can help you assign multi page agglomerative clustering techniques to achieve multi page placement in natural search for thousands of keywords. Hire an SEO content specialist to develop your website content, press, blog posts, etc.

6) Garden Wall’s - Ensure all of the pages in your site are linked to internally or externally. IF you have pages that are not linked to or from any source stop the garden wall on these pages by creating a Google sitemap and actually loading it up in Google (Google Webmaster tools can help)

Each site has its own keyword density or “weighting” formula this formula is based on many factors that a professional search engine optimization professional should be able to identify and manage. Your content matrix needs to be refreshed monthly to trigger natural keyword search placement for specific pages on your site or to help you out maneuver your competitor’s placement. These factors are also influenced or changed based on the code that your website was written in. (Certain code allows or disallows bot spidering: redirects, error pages, there are many things to consider, do you need a .php explode script? Or mod rewrite? Are you running on .asp?)

Individual verticals run off of different algorithms in Google- adjustments per vertical should be based on competitive evaluation of the sites that ARE in the top positions comparative to your own desire for page placement or page rank. Pull back link reports and placement reports for your competitors to determine how they achieved search placement. Companies that are interested in the “best” of breed competitive analysis tool for SEO need to have Syntryx.com

So how well does all this work?
Well I launched my company website a couple of months ago and was able to achieve top placement for most of my terms within 8 days (THROW out the theory that older domains place better please!) - Including a nice placement for the term “Search Management” in Google out of 135,000,000 other websites I am in the top 15! Yup, it works.

You do not have to understand Term Vector Model formulas to do SEO. You may,  like I, just simply have a knack for identifying patterns. It’s all in the code, just remember to leave a good crumb trail for the ants and some sugar for the spiders!

Happy SEO’ing and sigh, Yes! - Happy New Year!

14 Comments

Nice article Heather. It made even my head spinn despite the fact that I consider myself being pretty good at math hehe.

It would be cool if you could elaborate a bit on the keyword spread and count versus the old metrics called “Keyword Density”, which Google never used as far as I heard. Everybody agrees that Keyword Density, as in how often the term appears in relationship to the number of words of content, is irrelevant since the times of Infoseek. You are talking about a different density, right?

Regarding the domain age. I think the biggest problem is for a new domain to overcome the so called “Google Sandbox Effect” (”there is no Spoon”, aeh Sandbox, it only looks like one hehe). Depending on some factors, such as quantity and quality of inbound links does the sandbox effect usually last from three, up to six months. Once the domain is out of the sandbox (or not affected by the sandbox effect anymore), I also believe that the impact on ranking becomes smaller and smaller. Usually the side effects of domain age (= more inbound links) contribute to the fact that older domains appear to outrank younger ones.

Look at Wikipedia.org for example, the domain is not that old (created 2001-01-13) to outrank virtually everything ,unless Google applies special penalties to certain Wikipedia pages to knock them down. Wikipedia does not outrank everybody because of age of the domain, but because of the PageRank power (PR9) it accummulated and not “leaks” very much, due to fact that most outbound links are “nofollow” (different subject).

What do you think?

Hello Carsten,

I have never had a site we have developed (New site or Old) get into the Google sandbox.. Sorry never..

Non monetized and non commercialized online entities such as .edu’s - and wiki’s - Blog’s - rank higher and faster as they are low on the spam control spectrum for most algo’s.

Every website has it’s own potential to rank and place - the formula for “weighting” is in my post inlaid in number 4)

Hey Carsten,

Well I had a nice response however the post never showed up.. Will wait and see if it does.. If not I will post my response again.

Your comment was waiting for approval. I approved it and also removed the duplicate.

Anyway…

“I have never had a site we have developed (New site or Old) get into the Google sandbox.. Sorry never..”

I find it hard to believe to be honest. This does not sound like what pretty much all SEO experience and I have seen on some of my sites as well. Even Google admits that there is something “that could be described” as “Sandbox Effect” for new sites.

You can make a believer out of me. I happened to register by coincidence a domain just a few days ago, which is not commercial and virgin (new registration, no drop etc.).

It’s immigrationresources.info and I want to move the 11 pages with immigration resources from my family website at cumbrowski.de to that new domain.

Let’s get it to rank somewhere on the first four SERP pages for the phrase “immigration resources” before 3 months are over. It shows an estimated 3.5-3.6 Mio results for the phrase. There is no governmental site in the top 10 results for this phrase.

Intrigued? You can use this site as case study and blog about it or whatever you want. You can also put your own affiliate links on there, if you want to (there are two little ads on the right side).

Guten Morgen Carsten!

Hey thanks, actually I have been working on cases and testing for years now and am hoping to get the info out of my head and onto a book!

Have a great Sunday!

Jim Kukral said:

I wonder how many more people would have read this if the headline was “I’ve Cracked Google’s Algorithm” :)

Pat Grady said:

Wonderful article Heather!

Your point #3 above is such a simple thing, one I had been overlooking until late 2007. Analyzing my site’s navigation, internal linking scheme structures and inbound links has helped me to use nofollow and other things to better sculpt the PR flow that I want.

Hello Pat,

Thanks! For others seeking additional information, there is a fantastic article concerning use of no follow here - http://www.stonetemple.com/blog/?p=187

have a great day!

Tom Schmitz said:

Great article Heather.

When you say “Identify the pattern and you can beat the algorithms” it sounds like you place the majority emphasis on on-site SEO. Is this correct? I tell people that the more competitive the keywords the more important off-page factors become and the more likely external links will make the ranking difference. Care to give a ratio or your own rule of thumb?

Also, often when I conduct pattern analysis I discover that the top ranking websites across vertical markets turn-out to have lots of site-wide paid text link advertisements pointing to them and often from multiple domain networks (Can everyone say ‘In the news.”?) How do you identify optimal on-page patterns when off-page factors dominate the SERPs?

Hi Tom,

You hit the nail on the head with your questions…;)

“When you say “Identify the pattern and you can beat the algorithms” it sounds like you place the majority emphasis on on-site SEO. Is this correct?”

YES - I do place the majority of my strategic emphasis on - on-page content organization. However some sites might need higher off - page factors it depends on the site and the competitive environment.

“I tell people that the more competitive the keywords the more important off-page factors become and the more likely external links will make the ranking difference. Care to give a ratio or your own rule of thumb?”

YES - I think you are right concerning off - page factors making the difference with highly competitive terms. Incoming Links to content Ratio? I have not determined a rule of thumb at all, this appears to fluctuate with many factors involved.

Also, often when I conduct pattern analysis I discover that the top ranking websites across vertical markets turn-out to have lots of site-wide paid text link advertisements pointing to them and often from multiple domain networks (Can everyone say ‘In the news.”?) How do you identify optimal on-page patterns when off-page factors dominate the SERPs?

Wow great question and tricky - There are so many site wide factors that exist with this - If no optimal term weighting case exists (In the competitive environment across multiple verticals with the same site conditions) comparative to the incoming link volume of the site we are optimizing for then region by region I would slowly optimize and organize the sites content to achieve optimal exposure and placement and hope to find a balance to stick placement.

Tony Cohn said:

Great piece Heather. It is nice to see an SEO pro giving away tips for a change instead of just making a top ten list.

Heather, great post! Really great content. You managed to cover LSI, Markov Chaining, Information Retrieval, and Term Vector Modeling without speaking another language. Stumbled and Delicioused!

tracker said:

Hi,
I dont agree with aging factor as I know one website 3months old with a surprising page rank as PR=3, Can you believe, few months back it was PR0, now its PR=3, Can anyone look at it and see why it has PR=3 suddenly.
Website is dailyrics(dot)com

Thanks,
tracker

Hello Tracker,

Yup I see you are sporting a PR 3 - Was YatanWeb.com previously parked at this IP? I see a Google cache on Jan 9th from another url that is a PR4?

Leave a comment

(required)
(required)

Search Through 10 Years of ReveNews Content: