There is no class similar to this taught anywhere in the US -- by someone of Ryan's qualifications. Don't miss this opportunity.
The Internet is one giant API -- with a really terrible interface. Learn how to build web scrapers, crawlers and bots traverse websites, or even the Internet at large, collecting, parsing, and storing entire websites of content, statistics, and media in just minutes (and a few dozen lines of code). We’ll work with Python 3.5, and open source libraries including BeautifulSoup, Selenium, and Requests.
A brief outline of the the topics covered:
• Overview of libraries, scraping environments, best practices
• Discussion of the best legal and ethical practices around web scraping and crawling, dispelling some common misconceptions and looking at several important cases.
• Scraping a single page of data and parsing HTML
• Exception handling and building crawlers that can handle the messy web without dying
• Traversing a list of pages, handling URL generation and pagination
• Writing crawlers with good data models and and software architectures, for maximum convenience and minimum code maintenance. Write once, crawl anywhere!
• Types of crawler movement patterns: recursion, random walks, moving through internal and external links. When to use them, and how to modify them for your own needs.
• Handling forms, logins, and other common crawler challenges
We’ll also have time for Q&A, which may include additional advanced material, time-permitting, and looking at other tools and libraries to overcome common crawler challenges.
Requirements and Recommendations
To get the most out of this workshop, I recommend that you have at least a basic familiarity with Python, have a Python 3.5 environment up and running on your machine, and know how to run scripts and install new Python libraries from the command line.
If you would like to write your crawler output to MySQL, or another database, please have the server installed and running on your machine. You will have the option of using a database, or outputting to CSV (or another format) for several of the exercises.
About the Instructor
Ryan Mitchell (Linkedin) is a senior software engineer at HedgeServ in Boston, where she creates data analysis tools for hedge fund managers. She has a master’s degree in software engineering from Harvard University Extension School, and a bachelor’s in engineering from Olin College of Engineering. Prior to joining HedgeServ, she was a software engineer building web scrapers and bots at Abine Inc, and regularly does freelance work, building web scrapers for clients, primarily in the financial and retail industries.
Ryan is the author of two books about web scraping: Web Scraping with Python (O’Reilly, 2015), and Instant Web Scraping with Java (Packt, 2013).