小红书爆款笔记数据采集与分析
项目简介
本教程将指导你开发一个专业的小红书数据采集分析工具。该工具具备以下核心功能:
- 自动化采集爆款笔记数据
- 深度分析爆款内容特征
- 数据导出与可视化
- 自动生成分析报告
技术方案
为了实现高效稳定的数据采集和分析,我们采用以下技术栈:
后端技术:
- Node.js 运行环境
- TypeScript 类型系统
- Puppeteer 自动化
- Express 接口服务
- MongoDB 数据存储
前端展示:
- React 组件化开发
- Chart.js 数据可视化
开发流程
1. 环境准备
首先创建项目并完成依赖安装:
bash
mkdir redbook-scraper
cd redbook-scraper
npm init -y
npm install typescript ts-node @types/node puppeteer express mongodb
npm install -D nodemon @types/express
1
2
3
4
5
2
3
4
5
2. 目录结构设计
采用模块化的目录结构组织代码:
redbook-scraper/
├── src/
│ ├── config/ # 配置文件
│ ├── models/ # 数据模型
│ ├── services/ # 核心服务
│ ├── utils/ # 工具函数
│ └── index.ts # 入口文件
├── data/ # 数据存储
├── logs/ # 日志文件
└── tsconfig.json # TS配置
1
2
3
4
5
6
7
8
9
10
2
3
4
5
6
7
8
9
10
3. 系统配置
typescript
// src/config/config.ts
export const config = {
mongodb: {
url: process.env.MONGODB_URL || 'mongodb://localhost:27017/redbook',
dbName: 'redbook'
},
scraper: {
headless: true,
timeout: 30000,
userAgent: 'Mozilla/5.0 ...'
},
api: {
port: process.env.PORT || 3000
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2
3
4
5
6
7
8
9
10
11
12
13
14
15
4. 数据建模
数据模型设计采用模块化方案,包含笔记、用户、统计等多个模型:
typescript
// src/models/Note.ts
import { ObjectId } from 'mongodb'
import { z } from 'zod' // 使用 Zod 进行运行时类型验证
// 笔记模型验证schema
const NoteSchema = z.object({
_id: z.instanceof(ObjectId).optional(),
noteId: z.string().min(1),
title: z.string().min(1).max(100),
content: z.string().min(1),
images: z.array(z.string().url()),
likes: z.number().int().min(0),
comments: z.number().int().min(0),
shares: z.number().int().min(0),
author: z.object({
id: z.string(),
name: z.string(),
followers: z.number().int().min(0)
}),
tags: z.array(z.string()),
createdAt: z.date(),
scrapedAt: z.date(),
// 新增字段
location: z.string().optional(),
device: z.string().optional(),
topics: z.array(z.string()).optional(),
mentions: z.array(z.string()).optional(),
products: z.array(z.object({
id: z.string(),
name: z.string(),
price: z.number().optional()
})).optional()
})
// 导出类型
export type Note = z.infer<typeof NoteSchema>
// 数据库索引设计
export const NoteIndexes = [
{ key: { noteId: 1 }, unique: true },
{ key: { createdAt: -1 } },
{ key: { 'author.id': 1 } },
{ key: { tags: 1 } },
{ key: { likes: -1 } }
] as const
// 用户模型
export interface User {
_id?: ObjectId
userId: string
nickname: string
avatar: string
followers: number
following: number
notes: number
verified: boolean
verifiedType?: string
description?: string
location?: string
createdAt: Date
updatedAt: Date
}
// 统计模型
export interface Stats {
_id?: ObjectId
date: Date
totalNotes: number
totalUsers: number
avgLikes: number
avgComments: number
avgShares: number
topTags: Array<{
tag: string
count: number
}>
topAuthors: Array<{
userId: string
noteCount: number
totalEngagement: number
}>
updatedAt: Date
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
5. 数据采集服务
采集服务实现了多种高级特性:
typescript
import type { Note } from '../../models/Note'
// src/services/scraper/index.ts
import puppeteer from 'puppeteer'
import { config } from '../../config/config'
import { Logger } from '../../utils/logger'
import { retry } from '../../utils/retry'
import { sanitize } from '../../utils/sanitizer'
import { validate } from '../../utils/validator'
import { ProxyManager } from '../proxy'
import { RateLimiter } from '../ratelimit'
import { UserAgentManager } from '../useragent'
export class Scraper {
private browser: puppeteer.Browser | null = null
private proxyManager: ProxyManager
private uaManager: UserAgentManager
private rateLimiter: RateLimiter
private logger: Logger
constructor() {
this.proxyManager = new ProxyManager()
this.uaManager = new UserAgentManager()
this.rateLimiter = new RateLimiter({
maxRequests: 100,
perMinute: 1
})
this.logger = new Logger('scraper')
}
async initialize() {
const proxy = await this.proxyManager.getProxy()
this.browser = await puppeteer.launch({
headless: config.scraper.headless,
args: [
'--no-sandbox',
`--proxy-server=${proxy.host}:${proxy.port}`,
'--disable-dev-shm-usage',
'--disable-gpu'
],
defaultViewport: {
width: 1920,
height: 1080
}
})
}
async scrapeNote(url: string): Promise<Note> {
// 限流控制
await this.rateLimiter.acquire()
if (!this.browser)
throw new Error('Browser not initialized')
const page = await this.browser.newPage()
try {
// 设置请求拦截
await page.setRequestInterception(true)
page.on('request', (request) => {
if (request.resourceType() === 'image') {
request.abort()
}
else {
request.continue()
}
})
// 设置User-Agent
await page.setUserAgent(this.uaManager.getRandomUA())
// 设置超时
await page.setDefaultNavigationTimeout(config.scraper.timeout)
// 访问页面
await page.goto(url, {
waitUntil: 'networkidle0',
timeout: config.scraper.timeout
})
// 等待内容加载
await page.waitForSelector('.note-content', {
timeout: config.scraper.timeout
})
// 提取数据
const noteData = await page.evaluate(() => {
const title = document.querySelector('.title')?.textContent
const content = document.querySelector('.content')?.textContent
const likes = document.querySelector('.likes')?.textContent
const comments = document.querySelector('.comments')?.textContent
const shares = document.querySelector('.shares')?.textContent
const author = {
id: document.querySelector('.author-id')?.getAttribute('data-id'),
name: document.querySelector('.author-name')?.textContent,
followers: document.querySelector('.followers')?.textContent
}
const tags = Array.from(
document.querySelectorAll('.tag')
).map(tag => tag.textContent)
const images = Array.from(
document.querySelectorAll('.note-image')
).map(img => img.getAttribute('src'))
return {
title,
content,
likes: Number.parseInt(likes || '0'),
comments: Number.parseInt(comments || '0'),
shares: Number.parseInt(shares || '0'),
author: {
id: author.id || '',
name: author.name || '',
followers: Number.parseInt(author.followers || '0')
},
tags: tags.filter(Boolean) as string[],
images: images.filter(Boolean) as string[]
}
})
// 数据清洗
const cleanData = sanitize(noteData)
// 数据验证
const validData = validate(cleanData)
return {
noteId: url.split('/').pop() || '',
...validData,
createdAt: new Date(),
scrapedAt: new Date()
}
}
catch (error) {
this.logger.error('Scrape failed:', error)
throw error
}
finally {
await page.close()
}
}
// 批量采集
async scrapeNotes(urls: string[]): Promise<Note[]> {
const concurrency = config.scraper.maxConcurrency
const results: Note[] = []
// 并发控制
for (let i = 0; i < urls.length; i += concurrency) {
const batch = urls.slice(i, i + concurrency)
const promises = batch.map(url =>
retry(() => this.scrapeNote(url), {
retries: 3,
minTimeout: 1000,
maxTimeout: 5000
})
)
const batchResults = await Promise.allSettled(promises)
batchResults.forEach((result, index) => {
if (result.status === 'fulfilled') {
results.push(result.value)
}
else {
this.logger.error(
`Failed to scrape ${batch[index]}:`,
result.reason
)
}
})
// 批次间延迟
await new Promise(resolve =>
setTimeout(resolve, Math.random() * 2000 + 1000)
)
}
return results
}
async close() {
if (this.browser) {
await this.browser.close()
this.browser = null
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
6. 数据分析引擎
数据分析引擎提供多维度的内容分析能力:
typescript
// src/services/analyzer/index.ts
import { Note } from '../../models/Note'
import { Logger } from '../../utils/logger'
import { ImageAnalyzer } from '../image'
import { NLPService } from '../nlp'
import { StatisticsService } from '../statistics'
export class Analyzer {
private nlp: NLPService
private imageAnalyzer: ImageAnalyzer
private statsService: StatisticsService
private logger: Logger
constructor() {
this.nlp = new NLPService()
this.imageAnalyzer = new ImageAnalyzer()
this.statsService = new StatisticsService()
this.logger = new Logger('analyzer')
}
// 互动数据分析
async analyzeEngagement(note: Note) {
try {
const totalEngagement = note.likes + note.comments + note.shares
const engagement = {
total: totalEngagement,
likeRatio: note.likes / totalEngagement,
commentRatio: note.comments / totalEngagement,
shareRatio: note.shares / totalEngagement,
// 互动率计算
engagementRate: totalEngagement / note.author.followers,
// 时间维度
hourOfDay: new Date(note.createdAt).getHours(),
dayOfWeek: new Date(note.createdAt).getDay(),
// 分类
isViral: totalEngagement > 10000,
hasHighEngagement: totalEngagement > 5000,
// 互动质量
qualityScore: this.calculateQualityScore(note)
}
// 历史数据对比
const historical = await this.statsService.getHistoricalStats(note.author.id)
return {
...engagement,
// 相对表现
performanceVsAvg: totalEngagement / historical.avgEngagement,
trend: this.analyzeTrend(historical.engagements)
}
}
catch (error) {
this.logger.error('Engagement analysis failed:', error)
throw error
}
}
// 内容分析
async analyzeContent(note: Note) {
try {
// 文本分析
const textAnalysis = await this.nlp.analyze(note.content)
// 图片分析
const imageAnalysis = await Promise.all(
note.images.map(img => this.imageAnalyzer.analyze(img))
)
return {
text: {
length: note.content.length,
sentiment: textAnalysis.sentiment,
keywords: textAnalysis.keywords,
topics: textAnalysis.topics,
readability: textAnalysis.readability,
emoticons: textAnalysis.emoticons,
language: textAnalysis.language
},
images: {
count: note.images.length,
types: imageAnalysis.map(img => img.type),
colors: imageAnalysis.map(img => img.dominantColors),
objects: imageAnalysis.map(img => img.detectedObjects),
quality: imageAnalysis.map(img => img.qualityScore)
},
tags: {
count: note.tags.length,
categories: this.categorizeTags(note.tags),
popularity: await this.statsService.getTagsPopularity(note.tags)
},
products: note.products?.map(product => ({
...product,
category: this.categorizeProduct(product),
priceRange: this.getPriceRange(product.price)
}))
}
}
catch (error) {
this.logger.error('Content analysis failed:', error)
throw error
}
}
// 用户画像分析
async analyzeAuthor(note: Note) {
try {
const authorStats = await this.statsService.getAuthorStats(note.author.id)
return {
profile: {
followers: note.author.followers,
followersGrowth: authorStats.followersGrowth,
engagementRate: authorStats.avgEngagement / note.author.followers,
postFrequency: authorStats.postFrequency,
activeTime: authorStats.activeTimeDistribution
},
content: {
topCategories: authorStats.topCategories,
commonTags: authorStats.commonTags,
writingStyle: await this.nlp.analyzeStyle(authorStats.recentPosts),
imageStyle: await this.imageAnalyzer.analyzeStyle(authorStats.recentImages)
},
influence: {
score: this.calculateInfluenceScore(authorStats),
reach: this.calculateReach(authorStats),
niche: this.identifyNiche(authorStats)
}
}
}
catch (error) {
this.logger.error('Author analysis failed:', error)
throw error
}
}
private calculateQualityScore(note: Note): number {
// 实现质量分数计算逻辑
return 0
}
private analyzeTrend(engagements: number[]): string {
// 实现趋势分析逻辑
return 'up'
}
private categorizeTags(tags: string[]): Record<string, string[]> {
// 实现标签分类逻辑
return {}
}
private categorizeProduct(product: any): string {
// 实现商品分类逻辑
return ''
}
private getPriceRange(price?: number): string {
// 实现价格区间划分逻辑
return ''
}
private calculateInfluenceScore(stats: any): number {
// 实现影响力分数计算逻辑
return 0
}
private calculateReach(stats: any): number {
// 实现触达范围计算逻辑
return 0
}
private identifyNiche(stats: any): string {
// 实现领域识别逻辑
return ''
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
7. 监控告警系统
实现全方位的系统监控:
typescript
// src/services/monitor/index.ts
import { EventEmitter } from 'node:events'
import { Logger } from '../../utils/logger'
import { AlertManager } from './alert'
import { Metrics } from './metrics'
export class MonitoringSystem extends EventEmitter {
private metrics: Metrics
private alertManager: AlertManager
private logger: Logger
constructor() {
super()
this.metrics = new Metrics()
this.alertManager = new AlertManager()
this.logger = new Logger('monitor')
this.setupEventHandlers()
}
private setupEventHandlers() {
// 性能监控
this.on('request', this.handleRequest.bind(this))
this.on('error', this.handleError.bind(this))
this.on('scrapeComplete', this.handleScrapeComplete.bind(this))
// 资源监控
this.on('cpuUsage', this.handleCPUUsage.bind(this))
this.on('memoryUsage', this.handleMemoryUsage.bind(this))
this.on('diskUsage', this.handleDiskUsage.bind(this))
// 业务监控
this.on('dataQuality', this.handleDataQuality.bind(this))
this.on('scrapeFail', this.handleScrapeFail.bind(this))
this.on('proxyFail', this.handleProxyFail.bind(this))
}
// 处理请求事件
private async handleRequest(data: any) {
try {
await this.metrics.recordRequest(data)
if (data.duration > 5000) {
await this.alertManager.sendAlert({
level: 'warning',
title: '请求响应时间过长',
message: `请求 ${data.url} 响应时间为 ${data.duration}ms`
})
}
}
catch (error) {
this.logger.error('Failed to handle request event:', error)
}
}
// 处理错误事件
private async handleError(error: any) {
try {
await this.metrics.recordError(error)
await this.alertManager.sendAlert({
level: 'error',
title: '系统错误',
message: error.message,
stack: error.stack
})
}
catch (err) {
this.logger.error('Failed to handle error event:', err)
}
}
// 处理采集完成事件
private async handleScrapeComplete(data: any) {
try {
await this.metrics.recordScrape(data)
if (data.failureRate > 0.1) {
await this.alertManager.sendAlert({
level: 'warning',
title: '采集失败率过高',
message: `当前采集失败率为 ${data.failureRate}`
})
}
}
catch (error) {
this.logger.error('Failed to handle scrape complete event:', error)
}
}
// ... 其他事件处理方法
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
使用指南
1. 启动服务
bash
npm run dev
1
2. 接口调用
bash
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{"url":"https://www.xiaohongshu.com/note/..."}'
1
2
3
2
3
系统优化
1. 反爬策略
为确保数据采集的稳定性,实现以下优化:
- 动态延迟控制
- User Agent 轮换
- IP 代理池
- 验证码智能处理
typescript
// 智能延迟控制
async function randomDelay() {
const delay = Math.floor(Math.random() * 2000) + 1000
await new Promise(resolve => setTimeout(resolve, delay))
}
1
2
3
4
5
2
3
4
5
2. 数据备份
实现自动化数据备份机制:
typescript
import { readFileSync, writeFileSync } from 'node:fs'
// src/utils/backup.ts
import { MongoClient } from 'mongodb'
export async function backupData() {
const client = await MongoClient.connect(config.mongodb.url)
const db = client.db(config.mongodb.dbName)
const notes = await db.collection('notes').find().toArray()
writeFileSync(
`./data/backup-${Date.now()}.json`,
JSON.stringify(notes, null, 2)
)
await client.close()
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
3. 容错机制
实现智能重试策略:
typescript
async function withRetry<T>(
fn: () => Promise<T>,
retries = 3
): Promise<T> {
try {
return await fn()
}
catch (error) {
if (retries > 0) {
await randomDelay()
return withRetry(fn, retries - 1)
}
throw error
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2
3
4
5
6
7
8
9
10
11
12
13
14
15
高级特性
1. 数据可视化
基于 React 和 Chart.js 构建分析面板:
typescript
// frontend/src/components/Dashboard.tsx
import { Line } from 'react-chartjs-2'
export function Dashboard({ data }) {
const chartData = {
labels: data.map(d => d.date),
datasets: [{
label: '互动量趋势',
data: data.map(d => d.engagement.total)
}]
}
return (
<div className="dashboard">
<Line data={chartData} />
</div>
)
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2. 定时任务
集成自动化采集任务:
typescript
import cron from 'node-cron'
// 每天凌晨 2 点执行
cron.schedule('0 2 * * *', async () => {
try {
await scraper.initialize()
// 执行采集任务
await scraper.close()
}
catch (error) {
console.error('Cron job failed:', error)
}
})
1
2
3
4
5
6
7
8
9
10
11
12
13
2
3
4
5
6
7
8
9
10
11
12
13
常见问题解决
1. 采集异常
- 网络连接检查
- URL 格式验证
- 反爬措施应对
2. 数据完整性
- 加载超时优化
- 选择器维护
- 页面结构验证
3. 性能调优
- 数据库连接池
- 缓存策略
- 查询优化
最佳实践
- 遵循平台规则
- 合理控制频率
- 及时更新维护
重要提醒
- 数据安全保护
- 合规性要求
- 风控策略应对