Just as the title suggests, today I will show you how to scrape some basic user data from blackhatworld.com
You may be wondering why? Well, because I can, but also, some people may be able to benefit from the scraped info. For example, with the user data, we will scrape today, you can easily see who is the most active users are. Then you can select the users without premium membership and offer to buy them one. The reason for this is simple, in return, the user will surrender their comment signature and thus give your ad’s more exposure. In the world of SEO, this is actually a very cheap way to advertise.
Now let’s get started. What you will need for this tutorial is a registered copy of Scrapebox, Link Extractor Plugin installed, and the Premium Article Scraper addon.
We need to see how many registered users are online now. Just go to https://www.blackhatworld.com/online/?type=member and scroll down to see how many pages of results there are. Usually, there will be anywhere between 35 – 50 pages of results.
Now we need to generate result page URL’s. With the exception of the first-page result, the rest of them are sequential.
You can simply copy and paste the URL’s above into an excel or Google sheet and drag the corner of the last cell to make as many result pages as you require.
Paste your generated URL’s into the Scrapebox harvester and open the Link Extractor plugin. Before you start scraping, you will first need to add a filter to remove any URL’s not containing /members/ and set your numbers of connections to 1 with a delay of 5 seconds. If you don’t do this, you will be blocked from the site
We will now be using the Premium Article Scraper to extract the user details. But first, you will need to add the following configuration file to your Scrapebox folder. Download this zip file and extract it to ~\scrapebox64\Plugins\ArticleScraper\Definitions
Now open the plugin and there should be BHW members in the configurations. Select that and load URL’s from Scrapebox harvester. Now you need to adjust the number of connections again or you will be blocked. In the options tab on the bottom right-hand side of the plugin is the options. Navigate to the “worker” threads and set that to 1. If you have private proxies, then you can ignore that last step.
You can now hit start and wait for the scraping to finish. Once it has complete, you will need to export the results to a new folder dedicated to BHW user data.
Also, you will need to make sure the following export settings are made.
- Select Both Feilds
- Use field value as file name
- Use | for the separator.
- And overwrite the file name if it exists. Useful for updating current users.
What we need to do now is merge the scraped user data files together. This is very easy to do in windows. Sorry, I’m not a Mac guy. Just copy the folder location to your clipboard and the open windows command prompt. Type CMD in the windows search bar to get to it quickly. In the terminal type cd [paste your folder pather here with the brackets] then hit enter Now type the following command
copy *.txt bhwmembers.txt You should now see a file named bhwmembers.txt
Now here comes the tricky part, we need to format the new file so that it can be imported into a spreadsheet. Before we start though, you will need a portable application called “Find and Replace”, it’s a free program that’s very useful for filtering scrapes.
To filter out the mess from our harvested user data, you will need to run the application and replace the following text with |
You might notice that there are a lot of unwanted spaces, don’t stress, these will automatically be removed with the last step.
Now, all that’s left to do now is to import the txt file into google sheets.
Just login into https://docs.google.com/spreadsheets/u/0/ create a new sheet and import the file from the “file” menu on the top left-hand side.
Before you hit import, you need to make sure that you mark the “separator type” with |
Congratulations! You now know how to scrape blackhatworld.com user data.
NOTE: I have since discovered that some BHW users have the | symbol in their user profile. This obviously will cause issues for you when sorting. To solve this, just use a different character for the seperators.