for one, it’ll actually be readable. rather than you having to work out how to pull stuff out, and work around the fact that the entire forum is very heavily Javascript based.
Many moons ago I worked for a company which did website scraping. To answer your question about how Google, etc. does it if memory serves correctly it’s something like:
Acquire URL
Look for Robots.txt file
Initialize the crawl (Loop)
3a) While searching through the HTML look for any “A” tags and copy out the href into a temp array.
3b) compare the temp array entries to the robots.txt file, if it matches remove that entry from the array, if not, then you proceed.
Basically you’re just keeping a list of URL’s in an array, crawling to them and storing the HTML to a file, depending on your goals you may or may not want to do some more sophisticated things to preserve layout, etc.
As Steve said; this is very JavaScript heavy, which will introduce a few other challenges for you; JSON is far easier to parse and easily translatable to objects which can quickly and efficiently be parsed.
There aren’t actual ‘pages’ with the forum content to scrape. It loads as you go. Now, you can do this all with a headless browser, but that’d be a total pita to do.