Scraping the forums

I’m planning on scraping the forums. It’ll help me sharpening my skills, plus i’d finally gather the data i need to develop my fingerprinting idea. :slight_smile:

So, i’m curious.

  • Is there actually a chance i could hug the forum to death if i manage to fully utilise my 30mbit connection?

  • Would it be more advisable to use several IPs to evade any false “DDoS prevention” positives?

  • How do Google, Bing, DuckDuckGo, etc. do first time indexing of a site?

I’ve never actually done this before. I thought i’d ask instead of coming up with solutions helping me avoiding potential problems. :slight_smile:

Thank you for any information.

If you are going to scrape, make sure you use the json feeds, rather than actually scraping. for example

Why ?

for one, it’ll actually be readable. rather than you having to work out how to pull stuff out, and work around the fact that the entire forum is very heavily Javascript based.

That’s the point. I’d not sharpen my skills doing it the easy (and boring) way, would i? :slight_smile:

What about the questions in the initial post?

Many moons ago I worked for a company which did website scraping. To answer your question about how Google, etc. does it if memory serves correctly it’s something like:

  1. Acquire URL
  2. Look for Robots.txt file
  3. Initialize the crawl (Loop)
    3a) While searching through the HTML look for any “A” tags and copy out the href into a temp array.
    3b) compare the temp array entries to the robots.txt file, if it matches remove that entry from the array, if not, then you proceed.

Basically you’re just keeping a list of URL’s in an array, crawling to them and storing the HTML to a file, depending on your goals you may or may not want to do some more sophisticated things to preserve layout, etc.

As Steve said; this is very JavaScript heavy, which will introduce a few other challenges for you; JSON is far easier to parse and easily translatable to objects which can quickly and efficiently be parsed.

1 Like

There aren’t actual ‘pages’ with the forum content to scrape. It loads as you go. Now, you can do this all with a headless browser, but that’d be a total pita to do.

1 Like

I know, but i don’t know a better term, so i call them pages. :slight_smile:

Thank you both. I will look at the feed thing as well, but first i’m looking forward cognitively torturing myself. :smiley:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.