1.开始
一上来模仿爬取豆瓣的排行榜使用beautifulSoup库抓取,发现无法抓取到。分析网页代码发现音乐榜单的是iframe的嵌套
需要使用selenium库
2.selenium
selenium官网
需要根据自己浏览器的版本下载对应的驱动包,讲驱动包放在python.exe同一个文件夹,我的放在Anaconda3文件夹下
ps:谷歌浏览器在地址栏输入chrome://version/ 即可查看版本
3.分析网页语言结构
分析排行榜内容对应的html语言标签、属性
4.主要代码
将数据保存进Excel表格
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36' } driver = webdriver.Chrome() # https://selenium-python.readthedocs.io/installation.html driver.get(DOWNLOAD_URL) driver.switch_to.frame("g_iframe") li_list = driver.find_elements_by_xpath("//tbody/tr") content_list = [] for li in li_list: item = {} # item["cover"] = li.find_element_by_xpath(".//img[@class='rpic']").get_attribute("src")#只有前三首有封面 item["num"] = li.find_element_by_xpath(".//span[@class='num']").text item["songer"] = li.find_element_by_xpath(".//div[@class='text']").get_attribute("title") item["song"] = li.find_element_by_xpath(".//b").get_attribute("title") item["song_time"] = li.find_element_by_xpath(".//span[@class='u-dur ']").text print(item) content_list.append(item) ws1['A1'] = "排名" ws1['B1'] = "歌手" ws1['C1'] = "歌名" ws1['D1'] = "歌曲时间" for index, it in enumerate(content_list): col_a = 'A%s' % (index + 2) col_b = 'B%s' % (index + 2) col_c = 'C%s' % (index + 2) col_d = 'D%s' % (index + 2) ws1[col_a] = str(it['num']) ws1[col_b] = str(it['songer']) ws1[col_c] = str(it['song']) ws1[col_d] = str(it['song_time']) ws1.column_dimensions['B'].width = 60.0 # 调整列宽 ws1.column_dimensions['C'].width = 90.0 wb.save(filename=dest_filename) driver.quit()
|
代码链接