python – 如何使这个递归爬行函数迭代?

出于学术和性能的考虑,鉴于这种爬行递归式网页爬行功能(仅在给定域内进行爬网),使迭代运行的最佳方法是什么?目前运行时,当它完成时,python已经攀升到使用超过1GB的内存,而这在共享环境中运行是不可接受的.

   def crawl(self, url):
    "Get all URLS from which to scrape categories."
    try:
      links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
    except urllib2.HTTPError:
      return
    for link in links:
      for attr in link.attrs:
        if Crawler._match_attr(attr):
          if Crawler._is_category(attr):
            pass
          elif attr[1] not in self._crawled:
            self._crawled.append(attr[1])
            self.crawl(attr[1])
最佳答案
使用BFS而不是递归爬网(DFS):http://en.wikipedia.org/wiki/Breadth_first_search

您可以使用外部存储解决方案(例如数据库)来获取BFS队列以释放RAM.

算法是:

//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.

// initialization:
urlsToVisit.Add( rootUrl );

while(urlsToVisit.Count > 0) {
  var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
  var page = FetchPage(nextUrl);
  ProcessPage(page);
  visitedUrls.Add(nextUrl);
  var links = ParseLinks(page);
  foreach (var link in links)
     if (!visitedUrls.Contains(link))
        urlsToVisit.Add(link); 
}

转载注明原文:python – 如何使这个递归爬行函数迭代? - 代码日志