https://theinfo.org/

This is a very interesting website. It appears to have been dormant since 2008, but was started by Aaron Swartz (author of Guerrilla Open Access) and has a fair amount of interesting stuff on it.

The opening line says:

This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It’s a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.

It has tools for how to scrape, process, and do things with big amounts of data. You should learn to do all those things.

Unfortunately a lot of the links don’t work these days… maybe try to find it on archive.org?