Xiaohongshu Viral Notes Data Collection and Analysis
Project Overview
This tutorial will guide you through developing a professional Xiaohongshu data collection and analysis tool. The tool has the following core features:
- Automated viral note data collection
- Deep analysis of viral content characteristics
- Data export and visualization
- Automatic analysis report generation
Technical Solution
To achieve efficient and stable data collection and analysis, we use the following technology stack:
Backend Technology:
- Node.js runtime environment
- TypeScript type system
- Puppeteer automation
- Express API service
- MongoDB data storage
Frontend Display:
- React component development
- Chart.js data visualization
Development Process
1. Environment Setup
First, create the project and install dependencies:
mkdir redbook-scraper
cd redbook-scraper
npm init -y
npm install typescript ts-node @types/node puppeteer express mongodb
npm install -D nodemon @types/express
2. Directory Structure Design
Organize code using a modular directory structure:
redbook-scraper/
├── src/
│ ├── config/ # Configuration files
│ ├── models/ # Data models
│ ├── services/ # Core services
│ ├── utils/ # Utility functions
│ └── index.ts # Entry file
├── data/ # Data storage
├── logs/ # Log files
└── tsconfig.json # TS configuration
3. System Configuration
// src/config/config.ts
export const config = {
mongodb: {
url: process.env.MONGODB_URL || 'mongodb://localhost:27017/redbook',
dbName: 'redbook'
},
scraper: {
headless: true,
timeout: 30000,
userAgent: 'Mozilla/5.0 ...'
},
api: {
port: process.env.PORT || 3000
}
}
4. Data Modeling
Data model design adopts a modular approach, including models for notes, users, statistics, and more:
// src/models/Note.ts
import { ObjectId } from 'mongodb'
import { z } from 'zod' // Use Zod for runtime type validation
// Note model validation schema
const NoteSchema = z.object({
_id: z.instanceof(ObjectId).optional(),
noteId: z.string().min(1),
title: z.string().min(1).max(100),
content: z.string().min(1),
images: z.array(z.string().url()),
likes: z.number().int().min(0),
comments: z.number().int().min(0),
shares: z.number().int().min(0),
author: z.object({
id: z.string(),
name: z.string(),
followers: z.number().int().min(0)
}),
tags: z.array(z.string()),
createdAt: z.date(),
scrapedAt: z.date(),
// Additional fields
location: z.string().optional(),
device: z.string().optional(),
topics: z.array(z.string()).optional(),
mentions: z.array(z.string()).optional(),
products: z.array(z.object({
id: z.string(),
name: z.string(),
price: z.number().optional()
})).optional()
})
// Export type
export type Note = z.infer<typeof NoteSchema>
// Database index design
export const NoteIndexes = [
{ key: { noteId: 1 }, unique: true },
{ key: { createdAt: -1 } },
{ key: { 'author.id': 1 } },
{ key: { tags: 1 } },
{ key: { likes: -1 } }
] as const
// User model
export interface User {
_id?: ObjectId
userId: string
nickname: string
avatar: string
followers: number
following: number
notes: number
verified: boolean
verifiedType?: string
description?: string
location?: string
createdAt: Date
updatedAt: Date
}
// Statistics model
export interface Stats {
_id?: ObjectId
date: Date
totalNotes: number
totalUsers: number
avgLikes: number
avgComments: number
avgShares: number
topTags: Array<{
tag: string
count: number
}>
topAuthors: Array<{
userId: string
noteCount: number
totalEngagement: number
}>
updatedAt: Date
}
5. Data Collection Service
The collection service implements multiple advanced features:
import type { Note } from '../../models/Note'
// src/services/scraper/index.ts
import puppeteer from 'puppeteer'
import { config } from '../../config/config'
import { Logger } from '../../utils/logger'
import { retry } from '../../utils/retry'
import { sanitize } from '../../utils/sanitizer'
import { validate } from '../../utils/validator'
import { ProxyManager } from '../proxy'
import { RateLimiter } from '../ratelimit'
import { UserAgentManager } from '../useragent'
export class Scraper {
private browser: puppeteer.Browser | null = null
private proxyManager: ProxyManager
private uaManager: UserAgentManager
private rateLimiter: RateLimiter
private logger: Logger
constructor() {
this.proxyManager = new ProxyManager()
this.uaManager = new UserAgentManager()
this.rateLimiter = new RateLimiter({
maxRequests: 100,
perMinute: 1
})
this.logger = new Logger('scraper')
}
async initialize() {
const proxy = await this.proxyManager.getProxy()
this.browser = await puppeteer.launch({
headless: config.scraper.headless,
args: [
'--no-sandbox',
`--proxy-server=${proxy.host}:${proxy.port}`,
'--disable-dev-shm-usage',
'--disable-gpu'
],
defaultViewport: {
width: 1920,
height: 1080
}
})
}
async scrapeNote(url: string): Promise<Note> {
// 限流控制
await this.rateLimiter.acquire()
if (!this.browser)
throw new Error('Browser not initialized')
const page = await this.browser.newPage()
try {
// 设置请求拦截
await page.setRequestInterception(true)
page.on('request', (request) => {
if (request.resourceType() === 'image') {
request.abort()
}
else {
request.continue()
}
})
// 设置User-Agent
await page.setUserAgent(this.uaManager.getRandomUA())
// 设置超时
await page.setDefaultNavigationTimeout(config.scraper.timeout)
// 访问页面
await page.goto(url, {
waitUntil: 'networkidle0',
timeout: config.scraper.timeout
})
// 等待内容加载
await page.waitForSelector('.note-content', {
timeout: config.scraper.timeout
})
// 提取数据
const noteData = await page.evaluate(() => {
const title = document.querySelector('.title')?.textContent
const content = document.querySelector('.content')?.textContent
const likes = document.querySelector('.likes')?.textContent
const comments = document.querySelector('.comments')?.textContent
const shares = document.querySelector('.shares')?.textContent
const author = {
id: document.querySelector('.author-id')?.getAttribute('data-id'),
name: document.querySelector('.author-name')?.textContent,
followers: document.querySelector('.followers')?.textContent
}
const tags = Array.from(
document.querySelectorAll('.tag')
).map(tag => tag.textContent)
const images = Array.from(
document.querySelectorAll('.note-image')
).map(img => img.getAttribute('src'))
return {
title,
content,
likes: Number.parseInt(likes || '0'),
comments: Number.parseInt(comments || '0'),
shares: Number.parseInt(shares || '0'),
author: {
id: author.id || '',
name: author.name || '',
followers: Number.parseInt(author.followers || '0')
},
tags: tags.filter(Boolean) as string[],
images: images.filter(Boolean) as string[]
}
})
// 数据清洗
const cleanData = sanitize(noteData)
// 数据验证
const validData = validate(cleanData)
return {
noteId: url.split('/').pop() || '',
...validData,
createdAt: new Date(),
scrapedAt: new Date()
}
}
catch (error) {
this.logger.error('Scrape failed:', error)
throw error
}
finally {
await page.close()
}
}
// 批量采集
async scrapeNotes(urls: string[]): Promise<Note[]> {
const concurrency = config.scraper.maxConcurrency
const results: Note[] = []
// 并发控制
for (let i = 0; i < urls.length; i += concurrency) {
const batch = urls.slice(i, i + concurrency)
const promises = batch.map(url =>
retry(() => this.scrapeNote(url), {
retries: 3,
minTimeout: 1000,
maxTimeout: 5000
})
)
const batchResults = await Promise.allSettled(promises)
batchResults.forEach((result, index) => {
if (result.status === 'fulfilled') {
results.push(result.value)
}
else {
this.logger.error(
`Failed to scrape ${batch[index]}:`,
result.reason
)
}
})
// 批次间延迟
await new Promise(resolve =>
setTimeout(resolve, Math.random() * 2000 + 1000)
)
}
return results
}
async close() {
if (this.browser) {
await this.browser.close()
this.browser = null
}
}
}
6. Data Analysis Engine
The data analysis engine provides multi-dimensional content analysis capabilities:
// src/services/analyzer/index.ts
import { Note } from '../../models/Note'
import { Logger } from '../../utils/logger'
import { ImageAnalyzer } from '../image'
import { NLPService } from '../nlp'
import { StatisticsService } from '../statistics'
export class Analyzer {
private nlp: NLPService
private imageAnalyzer: ImageAnalyzer
private statsService: StatisticsService
private logger: Logger
constructor() {
this.nlp = new NLPService()
this.imageAnalyzer = new ImageAnalyzer()
this.statsService = new StatisticsService()
this.logger = new Logger('analyzer')
}
// 互动数据分析
async analyzeEngagement(note: Note) {
try {
const totalEngagement = note.likes + note.comments + note.shares
const engagement = {
total: totalEngagement,
likeRatio: note.likes / totalEngagement,
commentRatio: note.comments / totalEngagement,
shareRatio: note.shares / totalEngagement,
// 互动率计算
engagementRate: totalEngagement / note.author.followers,
// 时间维度
hourOfDay: new Date(note.createdAt).getHours(),
dayOfWeek: new Date(note.createdAt).getDay(),
// 分类
isViral: totalEngagement > 10000,
hasHighEngagement: totalEngagement > 5000,
// 互动质量
qualityScore: this.calculateQualityScore(note)
}
// 历史数据对比
const historical = await this.statsService.getHistoricalStats(note.author.id)
return {
...engagement,
// 相对表现
performanceVsAvg: totalEngagement / historical.avgEngagement,
trend: this.analyzeTrend(historical.engagements)
}
}
catch (error) {
this.logger.error('Engagement analysis failed:', error)
throw error
}
}
// 内容分析
async analyzeContent(note: Note) {
try {
// 文本分析
const textAnalysis = await this.nlp.analyze(note.content)
// 图片分析
const imageAnalysis = await Promise.all(
note.images.map(img => this.imageAnalyzer.analyze(img))
)
return {
text: {
length: note.content.length,
sentiment: textAnalysis.sentiment,
keywords: textAnalysis.keywords,
topics: textAnalysis.topics,
readability: textAnalysis.readability,
emoticons: textAnalysis.emoticons,
language: textAnalysis.language
},
images: {
count: note.images.length,
types: imageAnalysis.map(img => img.type),
colors: imageAnalysis.map(img => img.dominantColors),
objects: imageAnalysis.map(img => img.detectedObjects),
quality: imageAnalysis.map(img => img.qualityScore)
},
tags: {
count: note.tags.length,
categories: this.categorizeTags(note.tags),
popularity: await this.statsService.getTagsPopularity(note.tags)
},
products: note.products?.map(product => ({
...product,
category: this.categorizeProduct(product),
priceRange: this.getPriceRange(product.price)
}))
}
}
catch (error) {
this.logger.error('Content analysis failed:', error)
throw error
}
}
// 用户画像分析
async analyzeAuthor(note: Note) {
try {
const authorStats = await this.statsService.getAuthorStats(note.author.id)
return {
profile: {
followers: note.author.followers,
followersGrowth: authorStats.followersGrowth,
engagementRate: authorStats.avgEngagement / note.author.followers,
postFrequency: authorStats.postFrequency,
activeTime: authorStats.activeTimeDistribution
},
content: {
topCategories: authorStats.topCategories,
commonTags: authorStats.commonTags,
writingStyle: await this.nlp.analyzeStyle(authorStats.recentPosts),
imageStyle: await this.imageAnalyzer.analyzeStyle(authorStats.recentImages)
},
influence: {
score: this.calculateInfluenceScore(authorStats),
reach: this.calculateReach(authorStats),
niche: this.identifyNiche(authorStats)
}
}
}
catch (error) {
this.logger.error('Author analysis failed:', error)
throw error
}
}
private calculateQualityScore(note: Note): number {
// 实现质量分数计算逻辑
return 0
}
private analyzeTrend(engagements: number[]): string {
// 实现趋势分析逻辑
return 'up'
}
private categorizeTags(tags: string[]): Record<string, string[]> {
// 实现标签分类逻辑
return {}
}
private categorizeProduct(product: any): string {
// 实现商品分类逻辑
return ''
}
private getPriceRange(price?: number): string {
// 实现价格区间划分逻辑
return ''
}
private calculateInfluenceScore(stats: any): number {
// 实现影响力分数计算逻辑
return 0
}
private calculateReach(stats: any): number {
// 实现触达范围计算逻辑
return 0
}
private identifyNiche(stats: any): string {
// 实现领域识别逻辑
return ''
}
}
7. Monitoring and Alert System
Implement comprehensive system monitoring:
// src/services/monitor/index.ts
import { EventEmitter } from 'node:events'
import { Logger } from '../../utils/logger'
import { AlertManager } from './alert'
import { Metrics } from './metrics'
export class MonitoringSystem extends EventEmitter {
private metrics: Metrics
private alertManager: AlertManager
private logger: Logger
constructor() {
super()
this.metrics = new Metrics()
this.alertManager = new AlertManager()
this.logger = new Logger('monitor')
this.setupEventHandlers()
}
private setupEventHandlers() {
// 性能监控
this.on('request', this.handleRequest.bind(this))
this.on('error', this.handleError.bind(this))
this.on('scrapeComplete', this.handleScrapeComplete.bind(this))
// 资源监控
this.on('cpuUsage', this.handleCPUUsage.bind(this))
this.on('memoryUsage', this.handleMemoryUsage.bind(this))
this.on('diskUsage', this.handleDiskUsage.bind(this))
// 业务监控
this.on('dataQuality', this.handleDataQuality.bind(this))
this.on('scrapeFail', this.handleScrapeFail.bind(this))
this.on('proxyFail', this.handleProxyFail.bind(this))
}
// 处理请求事件
private async handleRequest(data: any) {
try {
await this.metrics.recordRequest(data)
if (data.duration > 5000) {
await this.alertManager.sendAlert({
level: 'warning',
title: '请求响应时间过长',
message: `请求 ${data.url} 响应时间为 ${data.duration}ms`
})
}
}
catch (error) {
this.logger.error('Failed to handle request event:', error)
}
}
// 处理错误事件
private async handleError(error: any) {
try {
await this.metrics.recordError(error)
await this.alertManager.sendAlert({
level: 'error',
title: '系统错误',
message: error.message,
stack: error.stack
})
}
catch (err) {
this.logger.error('Failed to handle error event:', err)
}
}
// 处理采集完成事件
private async handleScrapeComplete(data: any) {
try {
await this.metrics.recordScrape(data)
if (data.failureRate > 0.1) {
await this.alertManager.sendAlert({
level: 'warning',
title: '采集失败率过高',
message: `当前采集失败率为 ${data.failureRate}`
})
}
}
catch (error) {
this.logger.error('Failed to handle scrape complete event:', error)
}
}
// ... 其他事件处理方法
}
Usage Guide
1. Start Service
npm run dev
2. API Call
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{"url":"https://www.xiaohongshu.com/note/..."}'
System Optimization
1. Anti-Scraping Strategy
To ensure data collection stability, implement the following optimizations:
- Dynamic delay control
- User Agent rotation
- IP proxy pool
- Intelligent CAPTCHA handling
// Intelligent delay control
async function randomDelay() {
const delay = Math.floor(Math.random() * 2000) + 1000
await new Promise(resolve => setTimeout(resolve, delay))
}
2. Data Backup
Implement automated data backup mechanism:
import { readFileSync, writeFileSync } from 'node:fs'
// src/utils/backup.ts
import { MongoClient } from 'mongodb'
export async function backupData() {
const client = await MongoClient.connect(config.mongodb.url)
const db = client.db(config.mongodb.dbName)
const notes = await db.collection('notes').find().toArray()
writeFileSync(
`./data/backup-${Date.now()}.json`,
JSON.stringify(notes, null, 2)
)
await client.close()
}
3. Fault Tolerance
Implement intelligent retry strategy:
async function withRetry<T>(
fn: () => Promise<T>,
retries = 3
): Promise<T> {
try {
return await fn()
}
catch (error) {
if (retries > 0) {
await randomDelay()
return withRetry(fn, retries - 1)
}
throw error
}
}
Advanced Features
1. Data Visualization
Build analysis dashboard based on React and Chart.js:
// frontend/src/components/Dashboard.tsx
import { Line } from 'react-chartjs-2'
export function Dashboard({ data }) {
const chartData = {
labels: data.map(d => d.date),
datasets: [{
label: '互动量趋势',
data: data.map(d => d.engagement.total)
}]
}
return (
<div className="dashboard">
<Line data={chartData} />
</div>
)
}
2. Scheduled Tasks
Integrate automated collection tasks:
import cron from 'node-cron'
// Execute at 2 AM daily
cron.schedule('0 2 * * *', async () => {
try {
await scraper.initialize()
// Execute collection task
await scraper.close()
}
catch (error) {
console.error('Cron job failed:', error)
}
})
Common Issues and Solutions
1. Collection Exceptions
- Network connection check
- URL format validation
- Anti-scraping measures handling
2. Data Integrity
- Load timeout optimization
- Selector maintenance
- Page structure validation
3. Performance Tuning
- Database connection pool
- Caching strategy
- Query optimization
Best Practices
- Follow platform rules
- Control request frequency reasonably
- Maintain and update regularly
Important Reminders
- Data security protection
- Compliance requirements
- Risk control strategy handling