从内部获得文本HTML标签不知道全部属性 - Getting text from inside HTML tag without knowing all attributes

- 此内容更新于:2015-12-20
主题:

我想爬在码头工人所有存储库名称找到中心通过这个链接:https://hub.docker.com/search/?q=*和页面=1&isautomated=0&isofficial=1&pullcount=0&starcount=0的HTML标记我感兴趣的是:对于每个库data-reactid总是不同的。我使用Bash,愿grepdiv标记之间的文本包含类的每个div=“RepositoryListItem__repoName___3iIWs”。有人可以请帮我构建regexp和bash命令链来做?到目前为止我有:但这并不返回任何东西。美元的价值内容是正确的所以它是最后一个grep不是做我想要的。有人能帮吗?谢谢你们!

原文:

I'm trying to crawl all the repository names found in the docker hub via this link: https://hub.docker.com/search/?q=*&page=1&isAutomated=0&isOfficial=1&pullCount=0&starCount=0

The HTML tag I'm interested in is:

<div class="RepositoryListItem__repoName___3iIWs" data-reactid=".s0zyncta0w.1.2.1.0.0.$4lexnz/overtime.0.0.1.0">4lexnz/overtime</div>

where the data-reactid is always different for each repository.

I'm using Bash and would like to grep the text between the div tag for each div that contains class="RepositoryListItem__repoName___3iIWs". Can someone please help me construct a regexp and command chain to do that in bash?

So far I have:

content=$(curl -L 'https://hub.docker.com/search/?q=*&page=1&isAutomated=0&isOfficial=0&pullCount=0&starCount=0')
echo $content | grep -oP '(?<=<div class="RepositoryListItem__repoName___3iIWs").*?(?= </div>)'

but this doesnt return anything at all. The value of $content is correct so it's the last grep that's not doing what I want. Can someone help please? Thank you!

解决方案:
我认为你应该使用类似:在我看来,用这种方法可以提取一组包含文本内部的和。请注意,我很新的与grep使用正则表达式,所以可能会有一些聪明,但这可以做你正在寻找。国旗之前删除所有匹配的,我删除的
部分匹配。
原文:

I think you should use something like:

content=$(curl -L 'https://hub.docker.com/search/?q=*&page=1&isAutomated=0&isOfficial=0&pullCount=0&starCount=0')
echo $content | grep -oP '<div class="RepositoryListItem__repoName___3iIWs"\s(.)+?>(\K.+?)(?=<\/div>)'

It seems working for me, in this way you can extract a group containing exactly the text inside the <div > and </div>.

Please note that I'm quite new in using RegEx with grep, so there could be something clever, but this can do what you are looking for. The \K flag remove all matching before it, and with (?=) I remove the </div> part from the match.

楼主:完美!解决了,非常感谢:)

(原文:Perfect! That solved it, thank you very much :))

网友:太棒了!我添加了一些额外的信息如果你想让它更好!,)

(原文:Great! I added some extra info in case you want to make it better! ;))