Scraping the forums

yellow_parasol · November 7, 2017, 12:34pm

I’m planning on scraping the forums. It’ll help me sharpening my skills, plus i’d finally gather the data i need to develop my fingerprinting idea.

So, i’m curious.

Is there actually a chance i could hug the forum to death if i manage to fully utilise my 30mbit connection?
Would it be more advisable to use several IPs to evade any false “DDoS prevention” positives?
How do Google, Bing, DuckDuckGo, etc. do first time indexing of a site?

I’ve never actually done this before. I thought i’d ask instead of coming up with solutions helping me avoiding potential problems.

Thank you for any information.

Steve_Ronuken · November 7, 2017, 12:44pm

If you are going to scrape, make sure you use the json feeds, rather than actually scraping.

https://forums.eveonline.com/t/23275.json?track_visit=true&forceLoad=true&_=1510058592134 for example

yellow_parasol · November 7, 2017, 12:44pm

Why ?

Steve_Ronuken · November 7, 2017, 12:45pm

for one, it’ll actually be readable. rather than you having to work out how to pull stuff out, and work around the fact that the entire forum is very heavily Javascript based.

yellow_parasol · November 7, 2017, 12:46pm

That’s the point. I’d not sharpen my skills doing it the easy (and boring) way, would i?

What about the questions in the initial post?

Cwittofur_Cesaille · November 7, 2017, 1:31pm

Many moons ago I worked for a company which did website scraping. To answer your question about how Google, etc. does it if memory serves correctly it’s something like:

Acquire URL
Look for Robots.txt file
Initialize the crawl (Loop)
3a) While searching through the HTML look for any “A” tags and copy out the href into a temp array.
3b) compare the temp array entries to the robots.txt file, if it matches remove that entry from the array, if not, then you proceed.

Basically you’re just keeping a list of URL’s in an array, crawling to them and storing the HTML to a file, depending on your goals you may or may not want to do some more sophisticated things to preserve layout, etc.

As Steve said; this is very JavaScript heavy, which will introduce a few other challenges for you; JSON is far easier to parse and easily translatable to objects which can quickly and efficiently be parsed.

Steve_Ronuken · November 7, 2017, 1:40pm

There aren’t actual ‘pages’ with the forum content to scrape. It loads as you go. Now, you can do this all with a headless browser, but that’d be a total pita to do.

yellow_parasol · November 8, 2017, 12:58pm

I know, but i don’t know a better term, so i call them pages.

Thank you both. I will look at the feed thing as well, but first i’m looking forward cognitively torturing myself.

system1 · February 6, 2018, 12:58pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Prevent 3rd party's from Duping the forum Forum Feedback & Requests	16	1057	July 15, 2017
All registered forum users up until now, of all trust levels Third Party Developers	11	613	August 21, 2019
ESI Login for Google robots Third Party Developers	4	377	May 4, 2020
Yeah... No Forum Feedback & Requests	8	1005	December 15, 2017
What is the abuse they are talking about here? Third Party Developers	6	722	September 28, 2018

Scraping the forums

Related topics