Scraping Online Communities for your Outreach Campaigns
Blog

Scraping Online Communities for your Outreach Campaigns

Estimated reading time: 10 minutes

Online communities offer a wealth of intelligence for blog owners and business owners alike.

Exploring the data within popular communities will help you to understand who the major influencers are, what content is popular and who are the key content aggregators within your niche.

This is all fair and well to talk about, but is it feasible to be manually sorting through online communities to find this information out? Probably not.

This is where data scraping comes in.

What is Scraping and What Can it do?

I’m not going to go into great detail on what data scraping actually means, but to simplify this, here’s a definition from the Wikipedia page:

“Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.”

Let me explain this with a little example…

Imagine a huge community full of individuals within your industry. Each person within the community has a personal profile page that contains information about their interests, contact details, social profiles, etc.

If you were tasked with gathering all of this data on all of the individuals then you might start to hyperventilating at the thought of all the copy and pasting you’d need to do.

Well, an alternative is to scrape all of this content so that you can automate all of this process and easily export all of this information into a manageable, more consumable format in a matter of seconds. It’d be pretty awesome, right?

Luckily for you, I’m going to show you how to do just that!

The Example of Inbound.org

Recently, I wanted to gather a list of digital marketers that were fairly active on social media and shared a lot of content online within communities. These people were going to be some of my core targets to get content from the blog in front of.

To do this, I first found some active communities online where these types of individuals hang out. Being a digital marketer myself, this process was fairly easy and I chose Inbound.org as my starting place.

Scoping out Data Requirements

Each community is different and you’ll be able to gather varying information within each.

The place to look for this information is within the individual user profile pages. This is usually where the contact information or links to social media accounts are likely to be displayed.

For this particular exercise, I wanted to gather the following information:

  • Full name
  • Job title
  • Company name and URL
  • Location
  • Personal website URL
  • Twitter URL, handle and follower/following stats
  • Google+ URL, follower count and list of contributor URLs
  • Profile image URL
  • Facebook URL
  • LinkedIn URL

With all of this information I’ll be able to get a huge amount of intelligence about the community members. I’ll also have a list of social media accounts to add and engage with.

On top of this, with all the information on their websites and sites that they write for, I’ll have a wealth of potential link building prospects to work on.

Inbound.org Profiles

You’ll see in the above screenshot that a few of the pieces of data are available to see on the Inbound.org user profiles. We’ll need to get the other bits of information from the likes of Twitter and Google+, but this will all stem from the scraping of Inbound.org.

Scraping the Data

The idea behind this is that we can set up a template based on one of the user profiles and then automate the data gathering across the rest of the profiles on the site.

This is where you’ll need to install the SEO Tools plugin for Excel (it’s free). If you’ve not used this plugin before, don’t worry – I’ve put together a full tutorial here.

Once you’ve installed the plugin, you’re good to go on the actual scraping side of things…

Quick Note: Don’t worry if you don’t have a good knowledge of coding – you don’t need it. All you’ll need is a very basic understanding of reading some code and some basic Excel skills.

To begin with, you’ll need to do a little Excel admin. Simply add in some column titles based around the data that you’re gathering. For example, with my example of Inbound.org, I had, ‘Name’, ‘Position’, ‘Company’, ‘Company URL’, etc. which you can see in the screenshot below. You’ll also want to add in a sample profile URL to work on building the template around.

spreadsheet admin

Now it’s time to start getting hands on with XPath.

How to Use XPathOnURL()

This handy little formula is made possible within Excel by the SEO Tools plugin. Now, I’m going to keep this very basic because there are loads of XPath tutorials available online that can go into the very advanced queries that are possible to use.

For this, I’m simply going to show you how to get the data we want and you can have a play around yourself afterwards (you can download the full template at the end of this post).

Here’s an example of an XPath query that gathers the name of the person within the profile that we’re scraping:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

A2 is simply referencing the cell that contains the URL that we’re scraping. You’ll see in the screenshot above that this is Jason Acidre’s profile page.

The next part of the formula is the XPath.

What this essentially says is to scrape through the HTML to find a tag that has ‘user-profile’ id attached to it. This could be a div, span, a or whatever.

Once it’s found this tag, it then needs to look at the first h2 tag within this area and grab the text within it. This is Jason’s name, which you’ll see in the screenshot below of the code:

website code

Don’t be put off at this stage because you don’t need to go manually trawling through code to build these queries, there’s a much simpler way.

The easiest way to do this is by right-clicking on the element you want to scrape on the webpage (within Chrome); for example, on Inbound.org, this would be the profile name. Now click ‘Inspect element’.

inspect element

The developer tools window should now appear at the bottom of your browser (or in a separate window). Within that, you should see the element that you’ve drilled down on.

All you need to do now is right-click on it and press ‘Copy XPath’.

copy XPath

This will now copy the XPath code for your Excel formula to the clipboard. You’ll just need to add in the first part of the query, i.e. =XPathOnUrl(A2,

You can then paste in the copied XPath after this and add a closing bracket.

Note: When you use ‘Copy XPath’ it will wrap some parts of the code in double apostrophes (“) which you’ll need to change to single apostrophes. You’ll also need to wrap the copied XPath in double apostrophes.

Your finished code will look like this:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

You can then apply this formula against any Inbound.org profile and it will automatically grab the user’s full name. Pretty good, right?

Check out the full video tutorial below that I’ve put together that talks you through this whole process:

[sws_blue_box box_size=""] Want more useful video tutorials? Subscribe to my YouTube channel now![/sws_blue_box]

XPath Examples for Grabbing Other Data

As you’re probably starting to see, this technique could be scaled across any website online. This makes big data much more attainable and gives you the kind of results that an expensive paid tool would offer without any of the cost – bonus!

Here’s a few more examples of XPath that you can use in conjunction with the SEO Tools plugin within Excel to get some handy information.

Twitter Follower Count

If you want to grab the number of followers for a Twitter user then you can use the following formula. Simply replace A2 with the Twitter profile URL of the user you want data on. Just a quick word of warning with this one; it looks like it’s really long and complicated, but really I’ve just used another Excel formula to snip of the text ‘followers’ from the end.

=RIGHT(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"),LEN(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"))-10)

Google+ Follower Count

Like with the Twitter follower formula, you’ll need to replace A2 with the full Google+ profile URL of the user you want this data for.

=XPathOnUrl(H67,"//span[@class='BOfSxb']")

List of ‘Contributor to’ URLs

I don’t think I need to tell you the value of pulling in a list of websites that someone contributes content to. If you do want to know then check out this post that I wrote.

This formula is a little more complex than the rest. This is because I’m pulling in a list of URLs as opposed to just one entity. This requires me to use the StringJoin function to separate all of the outputs with a comma (or whatever character you’d like).

Also, you may notice that there is an additional section to the XPath query, “href”. This pulls in the link within the specific code block instead of the text.

As you’ll see in the full Inbound.org scraper template that I’ve made, this is how I pull in the Twitter, Google+, Facebook and LinkedIn profile links.

You’ll want to replace A2 with the Google+ profile URL of the person you wish to gather data on.

=StringJoin(", ",XPathOnUrl(A2,"//a[@rel='contributor-to nofollow']","href"))

Twitter Profile Image URL

If you want to get a large version of someone’s Twitter profile image then I’ve got just the thing for you.

Again, you’ll just need to substitute A2 with their Twitter profile URL.

=XPathOnUrl(A2,"//*[@class='profile-picture media-thumbnail js-tooltip']","data-resolved-url-large")

Some Findings from the Data I’ve Gathered

With all big data sets will come some interesting findings. Here’s a few little things that I’ve found from the top 100 influential users on Inbound.org.

average followers chart

The chart above maps out the average number of followers that the top 100 users have on both Twitter (12,959) and Google+ (9,601). As well as this, it shows the average number of users that they follow on Twitter (1,363).

The next thing that I’ve looked at is the job titles of the top 100 users. You can see the most common occurrences of terms within the tag cloud below:

Job titlesFinally, I had a look through all of the domains listed within each of the top 100 Inbound.org users’ Google+ ‘contributor to’ sections and mapped out the most frequently mentioned sites.

Here’s the spread of domains that were the most popular to be contributed to:

domain frequency

It Doesn’t Stop There

As you’ve probably gathered, this can be scaled out across pretty much any community/forum/directory/website online.

With this kind of intelligence in your armoury, you’ll be able to gather more intelligence on your targets and increase the effectiveness of your outreach campaigns dramatically.

Also, as promised, you can download my full Inbound.org scraper template below:

[sdfile url="http://www.matthewbarby.com/goodies/MatthewBarby-Inbound-Scraper.xlsx" redirect="http://www.matthewbarby.com/thanks-downloading-inbound-scraper/"]

TL;DR

  • Online communities hold valuable data on your target audiences – use it!
  • Scale out your intelligence gathering by brushing up on your XPath.
  • Download my Inbound.org scraper template and let it work its magic.

Extra Reading

Creating Efficient Data Collection Systems for SEO and Social

Data Scraping Guide for SEO & Analytics

Get your Klout score using XPathOnUrl

How To Return Multiple Nodes Using XPathOnUrl

Seperator

About Matthew Barby

A UK based digital marketing consultant, Matt oversees digital strategy at Wyatt International. He is a columnist for many different SEO publications, a lecturer for the Digital Marketing Institute and speaks at events across the UK.

Sign Up To My Newsletter

40 Responses

Charles Floate

Fantastic Tutorial Matt!
It’s a pain in the ass having to manually do this kind of content scraping and being able to pull tons of info straight into an excel doc can save you a LOT of time.
Nicely done! ^.^

Matthew Barby

Cheers mate. Yeah I think the key here is how easy it is to do – even a novice can go and create a scraping template for a community without having to touch APIs or any code at all.

Charles Floate

Exactly! You don’t have to learn python ;)

Evan Pryce

Sweet post to follow up your equally awesome post on Moz earlier this week. Might be able to do away with outsourcing for this sort of data collection with these tips. Keep the killer content coming.

Btw your clarity plugin is still saying your at Wow, might wanna change that when you get a mo.

Matthew Barby

Glad you liked it, Evan. I’ve been working on some pretty cool techniques using this process to gather intelligence about my outreach targets. I’m sure you’ll find lots of applications once you have a play around.

Oh, and thanks for the tip on my Clarity plugin!

Evan Pryce

Np Matt,

Gonna have a play around and see what I can do with this. Be good to give ctrl+c ctrl+v a break. Love any tips that make the mundane tasks like data gathering quicker and easier.

Henley Wing

Great tutorial, Matthew. There’s definitely a wealth of data you can gather for outreach and competitive analysis.

I’m not sure if XPath can do this, but is there a way to scrape say the last 100 links shared by an user in their Twitter timeline? I think that’d be useful for outreach, as you can get an idea of what they’re sharing and what interests them.

Matthew Barby

Hey Henley, good to hear from you on the blog – loving the new features on BuzzSumo btw.

Yes, there is a way to do this – you may want to check out this post I wrote a couple of months ago that shows how to do it and make use of the data: http://findmyblogway.com/outreach-targets/

Brian Dean

This. is. awesome.

Thanks for putting this tutorial together, Matthew.
I’ve never been good at coding so I’m glad you can do this without
having to code anything (except Excel).

Again, great work and I just tweeted it and upvoted it on Inbound.

Matthew Barby

Thanks, Brian. Glad you enjoyed it!

Benjamin Beck

Awesome job Matthew! Really appreciate you going through the steps on video. Really helpful!

Takeshi Young

Great post, I wish directories like this existed in my niche to be scraped!

Henley Wing

What niches do you usually work with?

Takeshi Young

I work in a variety of niches, but my current one is salsa dancing. Or just dance in general, just not a lot of great resources.

Charles Floate

So why not make one then!?

Takeshi Young

I plan to, eventually. Doesn’t solve my scraping problem though!

James

Problem with data scraping like this is it starts getting a bit… shit at any volume. I think there’s a setting you can change in SEO Tools but I’m not sure what it is – it starts freezing up your PC and running out of memory.

You also have difficulty scraping dynamic content loaded via AJAX (although there’s ways). Let’s suppose you want to see who is most often sharing your competitor’s content on Pinterest, i.e. http://www.pinterest.com/source/backpackerbanter.com/

For that, I’d give a shout out to Scrape Similar for Chrome. It works the same way – //div/div/div/div[3]/a/@href for the example above.

But if you’re really going to get serious about scraping, it becomes about PHP, Python or apparently Ruby.

Matthew Barby

Hi James,

I completely agree with you – this isn’t the most effective solution, but what it does do great is negate the need for any advanced coding knowledge.

Running Python scripts to crawl through web entities on a large scale is much more resource efficient and can be directly plugged into a web interface, etc. The only thing is, you’ll need a Python developer, and a good one, to scale this out.

I use these techniques to gather niche datasets and then hook the data up to some more sophisticated tools to get the most out of it.

Amul Patel

killer study.

Scraping/sorting/cleaning datasets for seeding purposes and for outreach is critical for growth.

David Fallarme

Thanks Matt! This is great and can be used in SO many different ways.

I’d love to hear what you do with this data after you have it. How do you make it actionable?

Matthew Barby

Hi David,

Thanks for chipping in – glad you liked the article. One of the ways that I make use of this kind of data is for link prospecting. For example, within the Inbound.org example, I’ve now got a long list of the websites that the top influencers within the Inbound marketing industry write for. That’s a perfect starting point for getting some exposure to my target market.

Alongside this, I’ve got a pool of influencers to start engaging with over social media in order to get my content in front of them and get my content shared widely.

What I don’t use this data for is to just spam people. This defies the whole point of the intelligence gathering exercise.

Jeffsauer

I meant to write a comment earlier, but wanted to say this is a pretty awesome method you have concocted!

Matthew Barby

Thanks, Jeff. There’s tons of ways to apply this as well.

Scott Nissenbaum

I would have to agree on this. since this is what I needed the most

Scott Nissenbaum

I would have to agree on this. since this is what I needed the most

Matt Antonino

This is great Matt! Can you tell me what the heck I’m doing wrong??

=XPathOnUrl(A1, “//*[@id='row_id_1']/td[6]/table/tbody/tr/td/span”)

This *should* work – I don’t know why it’s not… I’m baffled. If you can help, please do! :)

Matt Antonino

This is great Matt! Can you tell me what the heck I’m doing wrong??

=XPathOnUrl(A1, “//*[@id='row_id_1']/td[6]/table/tbody/tr/td/span”)

This *should* work – I don’t know why it’s not… I’m baffled. If you can help, please do! :)

Carpet cleaning Sydney

It’s a informative tips, I agree your post.

Herman Rosa

Hey Matt, just wanted to jump in and say man ur doing a great job posting really helpful content. I have to give you credit I have definitely learned a thing or two from your post. Good $#!+ Dude! Keep’em coming.

Matthew Barby

Thanks, Herman. Always appreciate the feedback.

Nick

Some good tips. We use Feedity – http://feedity.com to do most of such extraction automatically. It’s a very handy tool.

Tresnic Media

This is awesome, Matt! Maybe this is a dumb question, but did you manually enter in all the profile URLs or was there something I missed to use xpath to automatically snag the user profile links as well?

Matthew Barby

I actually pulled in the profile URLs by using the Scrape Similar plugin for Chrome. Alternatively, you could grab a lit of the URLs with Screaming Frog SEO Spider or Xenu Link Sleuth.

Tresnic Media

Awesome, thanks! This whole process is terrific.

Conor Mulcahy

hey Matthew,

just a quick update for your blog post.

There is an issue pulling down the xpath values on twitter since their new profile update. The reason being is that SEOTools xpath checked appears to be treating Styles Classes as DIVS, and subsequently it cannot identify the item selected. To rectify, I simply amended the first bracketed number from a 1 to a 2 (which basically adds another div).

So this:
=XPathOnUrl(A2,”//*[@id='page-container']/div[1]/div/div[2]/div/div/div[2]/div/div/ul/li[4]/a/span[2]“)

becomes:
=XPathOnUrl(A2,”//*[@id='page-container']/div[2]/div/div[2]/div/div/div[2]/div/div/ul/li[4]/a/span[2]“)

Matthew Barby

Hey Connor,

That’s great timing as I was looking into this earlier today because of the new Twitter layout – will update this in the post. Send me your website URL and I’ll credit you next to the update :)

Conor Mulcahy

​No worries Matthew,

My website is a little under the weather at the moment & I’ve no time to fix it, so so for all intensive purposes it’s out of commission for the time being.

You can use my G+ instead if you wish:
https://plus.google.com/+ConorMulcahy

Conor Mulcahy

Copying the xpath for the website & the location data from twitter is a little different as these are located elsewhere on the twitter profile page.

To solve, Change

From this:
=XPathOnUrl(B3,”//*[@id='page-container']/div[4]/div/div/div[1]/div/div/div/div[1]/div[2]/span[2]/a”)

To this:
=XPathOnUrl(B3,”//*[@id='page-container']/div[3]/div/div/div[1]/div/div/div/div[1]/div[2]/span[2]/a”)

The above change removes a legitimate DIV, so I’m not sure why it works, it doesn’t make sense to me.

John

I don’t understand I have done everything like it says in the tutorial, but I am always getting a #VALUE! error.
Any suggestions what am I doing wrong?

Atomic76

Great tutorial! but I am running into problems any time I try to scrape data that’s within an HTML table. I keep getting a “#VALUE!” error in Excel, any time I use the Xpath from some data that is in a table. Any suggestions?