
1.开始
一上来模仿爬取豆瓣的排行榜使用beautifulSoup库抓取,发现无法抓取到。分析网页代码发现音乐榜单的是iframe的嵌套
需要使用selenium库
2.selenium
selenium官网
需要根据自己浏览器的版本下载对应的驱动包,讲驱动包放在python.exe同一个文件夹,我的放在Anaconda3文件夹下
ps:谷歌浏览器在地址栏输入chrome://version/ 即可查看版本
3.分析网页语言结构
分析排行榜内容对应的html语言标签、属性
4.主要代码
将数据保存进Excel表格
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 
 | headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
 }
 driver = webdriver.Chrome()
 # https://selenium-python.readthedocs.io/installation.html
 driver.get(DOWNLOAD_URL)
 driver.switch_to.frame("g_iframe")
 li_list = driver.find_elements_by_xpath("//tbody/tr")
 content_list = []
 for li in li_list:
 item = {}
 # item["cover"] = li.find_element_by_xpath(".//img[@class='rpic']").get_attribute("src")#只有前三首有封面
 item["num"] = li.find_element_by_xpath(".//span[@class='num']").text
 item["songer"] = li.find_element_by_xpath(".//div[@class='text']").get_attribute("title")
 item["song"] = li.find_element_by_xpath(".//b").get_attribute("title")
 item["song_time"] = li.find_element_by_xpath(".//span[@class='u-dur ']").text
 print(item)
 content_list.append(item)
 ws1['A1'] = "排名"
 ws1['B1'] = "歌手"
 ws1['C1'] = "歌名"
 ws1['D1'] = "歌曲时间"
 for index, it in enumerate(content_list):
 col_a = 'A%s' % (index + 2)
 col_b = 'B%s' % (index + 2)
 col_c = 'C%s' % (index + 2)
 col_d = 'D%s' % (index + 2)
 ws1[col_a] = str(it['num'])
 ws1[col_b] = str(it['songer'])
 ws1[col_c] = str(it['song'])
 ws1[col_d] = str(it['song_time'])
 ws1.column_dimensions['B'].width = 60.0  # 调整列宽
 ws1.column_dimensions['C'].width = 90.0
 wb.save(filename=dest_filename)
 driver.quit()
 
 
 | 
代码链接