Datehoer的博客
我本将心向明月,奈何明月照沟渠
Datehoer的博客

当前位置:主页 > 技术分享 > python 爬取传智播客作业题目

python 爬取传智播客作业题目

浏览: 作者:Datehoer 发布日期:2020-12-21 23:02:55 来源: 原创
今天发一下我之前写了一会的关于爬取传智播客习题的爬虫。

目前来说,其实是因为最近期末,然后想要爬取一下题库里的题,然后做一个题库出来,之所以这样,是因为老师并没有公布答案!

不过没想到传智播客居然是用JavaScript渲染的页面,而非普通的html,难度很大对于初学者。

目前找到了想要的内容但是并不知道该如何爬取。
python爬取传智播客内容

首先在这里右键查看一下源代码,之所以如此,是因为我本来以为可以直接通过这里获取链接跳转的内容,然后爬取出来,但是没想到并没有,并且其实传智里面的页面代码是通过js渲染的,所以应该通过header请求来获取数据。

然后使用f12查看Network,刷新页面,在寻找请求获取内容的时候,发现请求名为list?status=&name=&pageNumber=1&pageSize=10&courseId=5b1d1d6e854e408b89043bb29c604313&type=1&t=1608267215390这个的请求,里面的preview中,有想要获取到的内容

{"code":null,"errorMessage":null,"resultObject":{"items":[{"paper_tpl_name":"第七章DOM主观题","start_time":"2020-12-08 20:09:00","end_time":"2020-12-31 20:09:00","status":2,"busyworkId":"655d240321db424ea37f4b55c2f132d2","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第七章DOM 客观题","start_time":"2020-12-08 20:08:00","end_time":"2020-12-31 20:08:00","status":3,"busyworkId":"2b5062135cf84e8f9dfe2a77e06d2123","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第四章 函数 客观题","start_time":"2020-11-29 10:43:00","end_time":"2020-12-31 10:43:00","status":3,"busyworkId":"da8bd6ad0a63474e9fff0b5310d942b1","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第四章 主观题","start_time":"2020-11-29 10:41:00","end_time":"2020-12-31 10:41:00","status":2,"busyworkId":"408aacdac7da41038fbc13ce327d43ce","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第三章数组主观题","start_time":"2020-11-02 14:36:00","end_time":"2021-01-03 14:36:00","status":2,"busyworkId":"247a6f8b309b4e9c8fe7880b5b30198f","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第三章数组客观题","start_time":"2020-11-02 14:35:00","end_time":"2021-01-02 14:35:00","status":3,"busyworkId":"663eee4dbe5e4705923d89519461610a","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第二章js第二章主观题","start_time":"2020-10-12 15:11:00","end_time":"2021-01-02 15:11:00","status":2,"busyworkId":"f47eea78f49c4682bb80b24f99fa6ea9","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第二章js语法客观题","start_time":"2020-10-12 15:03:00","end_time":"2021-01-02 15:03:00","status":3,"busyworkId":"43aaa769160e4396b5be93f422e793c4","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"进行中"},{"paper_tpl_name":"第一章习题","start_time":"2020-09-13 20:59
:00","end_time":"2020-12-31 20:59:00","status":2,"busyworkId":"166955876c3b4bcf8ef39976dfe60e76","studentId":"e7b9174113e84748a996adeca8e67cef","score":null,"startEndStatusText":"
进行中"}],"totalCount":9,"totalPageCount":1,"pageSize":10,"currentPage":1},"success":true}


然后发现这里面的内容是一个id,然后发现下面的题目的链接就是这个页面加上id即可。
所以其实通过header请求头,然后提取出来的josn文件,再进行解析即可。
之后再获取题目里的内容即可。


其实这个我想了好久,因为当时不知道获取到的内容是json,以为是字符串,然后通过正则来获取数据的时候,发现只能获取第一个匹配的,然后后来问了大佬才知道这是json的数据。

其实他这个 是通过json传递的数据,然后里面是字典的镶嵌。

importrequests_html

importre

importjson

url="http://stu.ityxb.com/back/bxg/my/busywork/findStudentBusywork?busyworkId=166955876c3b4bcf8ef39976dfe60e76&t=1608445306731"

url2='http://stu.ityxb.com/lookPaper/busywork/166955876c3b4bcf8ef39976dfe60e76'

headers={

'User-Agent':'Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/87.0.4280.88Safari/537.36',

'Cookie':''

}

res=requests_html.HTMLSession()

r=res.get(url,headers=headers)

#withopen("js.txt",'r',encoding='utf-8')asj:

#neirong=j.read()

text=json.loads(r.text)

text_xz=text['resultObject']["danxuan"]['lists']

neirong={}

#print(len(text_xz))

changdu=len(text_xz)

foriinrange(1,len(text_xz)+1):

wenzi=text_xz.pop(0)

text_xx=wenzi["questionOptionList"]

#id=changdu-i+1

text_tm=str(i)+'.'+wenzi["questionContentText"]

print(text_tm)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+text_tm)

fornuminrange(len(text_xx)):

xx_dan=text_xx.pop(0)

xx_nei=xx_dan['text']

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+xx_nei)


这里就能提取单选的内容了,然后再就是获取全部的内容就行了,但是发现有的作业里面只有单选或者只有判断之类的,然后我们可以通过try来进行报错处理,这样即使没有这个选项,也能运行下去。

importrequests_html

importre

importjson

url="http://stu.ityxb.com/back/bxg/my/busywork/findStudentBusywork?busyworkId=166955876c3b4bcf8ef39976dfe60e76&t=1608445306731"

url2='http://stu.ityxb.com/back/bxg/my/busywork/findStudentBusywork?busyworkId=655d240321db424ea37f4b55c2f132d2&t=1608476055939'

headers={

'User-Agent':'',
'Cookie':''

}

res=requests_html.HTMLSession()

r=res.get(url2,headers=headers)

#withopen("js.txt",'r',encoding='utf-8')asj:

#neirong=j.read()

text=json.loads(r.text)

text_xz=text['resultObject']["danxuan"]['lists']

neirong={}

#print(len(text_xz))

foriinrange(1,len(text_xz)+1):

wenzi=text_xz.pop(0)

text_xx=wenzi["questionOptionList"]

#id=changdu-i+1

text_tm=str(i)+'.'+wenzi["questionContentText"]

print(text_tm)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+text_tm)

fornuminrange(len(text_xx)):

xx_dan=text_xx.pop(0)

xx_nei=xx_dan['text']

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+xx_nei)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+'答案:')

try:

text_xz=text['resultObject']["duoxuan"]['lists']

foriinrange(1,len(text_xz)+1):

wenzi=text_xz.pop(0)

text_xx=wenzi["questionOptionList"]

#id=changdu-i+1

text_tm=str(i)+'.'+wenzi["questionContentText"]

print(text_tm)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+text_tm)

fornuminrange(len(text_xx)):

xx_dan=text_xx.pop(0)

xx_nei=xx_dan['text']

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+xx_nei)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+'答案:')

except:

pass

try:

text_pd=text['resultObject']["panduan"]['lists']

foriinrange(1,len(text_pd)+1):

wenzi=text_pd.pop(0)

#id=changdu-i+1

text_tm=str(i)+'.'+wenzi["questionContentText"]

print(text_tm)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+text_tm+'\n'+'答案:')

 

except:

pass

try:

text_tk=text['resultObject']["tiankong"]['lists']

foriinrange(1,len(text_tk)+1):

wenzi=text_tk.pop(0)

text_tm=str(i)+'.'+wenzi["questionContentText"]

print(text_tm)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+text_tm+'\n'+'答案:')

except:

pass

try:

text_jd=text['resultObject']["jianda"]['lists']

foriinrange(1,len(text_jd)+1):

wenzi=text_jd.pop(0)

text_tm=str(i)+'.'+wenzi["questionContentText"]

print(text_tm)

withopen('jsdx.text','a',encoding='utf-8')asf:

f.write('\n'+text_tm+'\n'+'答案:')

except:

pass



这就是python爬取传智播客的作业题目的完整代码了,不过问题是需要复制链接,不过还算可以吧,以后如果改进的话会贴出来的。
不过其实很容易,就是解析前一页的数据,把里面包含链接的内容提取出来,然后再传入for循环即可,如果想要尝试的话可以自行测试,有不懂的随时问我。
如果有什么不会的可以在评论区留言,我会在看到的第一时间回复的。





版权:本文由Datehoer原创,著作权归作者所有。商业转载请联系作者获得授权,非商业转载请保留以上作者信息和原文链接本文链接:https://zjzdmc.top/jsfx/96.html。

文章推荐

热门标签

返回顶部
下面为相关推荐
说点什么吧
  • 全部评论(0