seerc 2014 Circle of digits (二分+后缀数组)

题目链接

题意:把一个长度为n的只由数字构成的串分成k个不为空的字串,使得最大的串最小(大小是说串所对应的十进制数的大小)

思路:由于长度为x的串肯定大于长度为x-1的串,因此很容易想到,我们要尽可能使得k组串的长度尽可能平均(避免出现某一个串的长度非常大的情况)

我们可以知道,最大值的串的长度一定为 LEN=(n+k-1)/k;

而每一组的长度,只可能是LEN或者LEN-1。

然后build_sa

注意循环串的几个地方记得%n

接下来二分sa数组的下标。

二分check的时候,先枚举断点,断环为链。

由于每部分最长的长度为LEN,所以0..LEN-1中一定存在一个断点。

然后贪心,尽可能取LEN

根据rk值来决定某一段的长度是LEN还是LEN-1(如果rk值比当前的大,那么就只能取LEN-1,否则取LEN)

如果此时k段的长度之和超过了n,说明此时的最大值还可能更小。

于是继续二分区间的前一半。

 

 

codeforces 123 D. String (后缀数组+两次二分得到区间+rmq)

题目链接

题意:定义一个函数F..

For exampe: F(babbabbababbab, babb) = 6. The list of pairs is as follows:

(1, 4), (4, 7), (9, 12)

Its continuous sequences are:

  • (1, 4)

  • (4, 7)

  • (9, 12)

  • (1, 4), (4, 7)

  • (4, 7), (9, 12)

  • (1, 4), (4, 7), (9, 12)

    erfen


erfen
题目描述得很烂..看例子把..大概就是:如果字符串y在字符串x中出现n次,那么F(x,y)=n*(n+1)/2

现在给一个字符串,求所有的F(s,x)的和,x为字符串的所有不相同的子串.

 

思路:由于刚刚写了一个求一个字符串所有不同子串个数的题目...于是就想到了后缀数组...

写完之后观察height[i].如果把height[i]看成底在x轴上的第i个矩形的高的话,n就是一段连续的矩形的长度.

 

然后...暴力会tle 48

题解说单调栈...但是单调栈之后还要线段树or并查集? (by 羊神)

...不会啊orz

最后用了二分+rmp过掉的

大概就是两次二分得到一个矩形的区间,和whust2016 #1的那道题有点像.

poj 2406 Power Strings (后缀数组||kmp)

poj 2406

题意:给定一个字符串 L,已知这个字符串是由某个字符串 S 重复 R 次而得到的,
求 R 的最大值

思路:论文题.

转载论文中的题解:

做法比较简单,穷举字符串 S 的长度 k,然后判断是否满足。判断的时候,
先看字符串 L 的长度能否被 k 整除,再看 suffix(1)和 suffix(k+1)的最长公共
前缀是否等于 n-k。在询问最长公共前缀的时候,suffix(1)是固定的,所以 RMQ
问 题 没 有 必 要 做 所 有 的 预 处 理 , 只 需 求 出 height 数 组 中 的 每 一 个 数 到
height[rank[1]]之间的最小值即可。整个做法的时间复杂度为 O(n)

最关键的在加黑的那句话:看 suffix(1)和 suffix(k+1)的最长公共
前缀是否等于 n-k

why???

转载一段证明:

为什么这样就行了呢?这要充分考虑到后缀的性质。当lcp1k+1=n-k时,后缀k+1是后缀1(即整个字符串)的一个前缀。(因为后缀k+1的长度为n-k)那么,后缀1的前k个字符必然和后缀k+1的前k个字符对应相同。而后缀1的第k+1..2k个字符,又相当于后缀k+1的前k个字符,所以与后缀1的前k个字符对应相同,且和后缀kk+1..2k又对应相同。依次类推,只要lcp(1,k+1)=n-k,那么s[1..k]就可以通过自复制n/k次得到整个字符串。找出k的最小值,就可以得到n/k的最大值了。

虽然这道题不适合用后缀数组做,倍增会tle,dc3也是卡时间才能过,但是接触到了一个想法.

要看一个字符串s能否由一个较小的长度为k的字符串t重复若干次得到,除了要整除以外,

gengxin判断suffix(1)和 suffix(k+1)的最长公共前缀是否等于 n-k即可.

 

下面是用倍增写的tle了的代码,价值在于那段没有用rmq,而是o(n)更新height数组到height[rk[0]]之间的最小值要怎么写.

 

 

 

 

 

kmp的思路:

KMP中的get_next(),或者get_nextval(),对next数组的应用。next[len]是最后一个字符跳的步长,如果他有相同字符串,则该串长度是len-next[len](这点我还在想要怎么证明!)…如果整个长度len能分解成x个这种串(能整除),就得到ans了。否则不能分解。只能是由他自己组成串,长度为1。

这道题用kmp非常块…

但是其实…………

其实………..

 

 

 

 

其实我不会kmp………………….

搞完后缀数组去学下.

 

spoj SUBST1 – New Distinct Substrings(后缀数组)

题目连接

题意:求所有不同的子串个数。

思路:后缀数组。和上一道题一样,就是数据范围变成了 5E4…1A

 

 

 

 

spoj DISUBSTR – Distinct Substrings (统计字串个数,后缀数组)

题目链接

题意:给出一个字符串,问所有不同的字串的个数。

思路:直接求比较困难。我们考虑,假如组成字符串的所有字符都不相同,那么就没有相同的字串,假设字符串的长度为n,那么长度为1的子串有n个,为2的有n-1个。。。为n的有1个,一共就是n*(n+1)/2个。。但是实际上会有重复的。。。

我们再次考虑这张图。Selection_004

先找一个字符重复的个数,对应height[i]数组就是找height[i]大于等于1个的个数(因为x个height代表了x+1个后缀,保留1个,重复了x个,所以重复的个数恰好和符合条件的height数组对应

接着找大于等于2的个数,大于等于3的个数…

最后再把所有答案累加起来,就是总共重复的次数。

然后按照我推出的这个结论,试着写了一发。。。1A蛤蛤蛤。。。

能想到这里大概是因为之前的题目让我得出了,“height数组是个小妖精”的结论,所以入手就先观察了一下height数组。。。

具体的实现呢,就是先统计height[i]中每个值出现的次数,然后做一个后缀和,最后累加。

 

 

 

poj 3261 Milk Patterns (最长公共子串,后缀数组)

poj3261

题意:给一个字符串,要求找出至少出现k次的最长重复子串…

思路:后缀数组,然后再次用到了根据height数组对后缀进行分组的套路…二分判定合法性,对于当前的最长长度x,分组使得每组中的height[i]都大于等于x,所不同的是,判定变成了存在一个组,后缀的个数至少为k个(因为这样,就可以对于大于等于k个的后缀,同时取前x长度,得到的就是出现了至少k次且长度为x的前缀)1A,蛤蛤蛤

 

poj 1743 Musical Theme (不可重叠最长重复子串,后缀数组)

poj 1743

题意:n个数字(1..88)表示的音符,问最长的连续两段长度至少为5的变化相同的音符段的长度。。。

思路:求最长重复字串。。。。很容易想到后缀数组。。但是这道题多了一个不可重叠的要求。

先二分答案,把题目变成判定性问题:判断是否
存在两个长度为 k 的子串是相同的,且不重叠。解决这个问题的关键还是利用
height 数组。把排序后的后缀分成若干组,其中每组的后缀之间的 height 值都
不小于 k。例如,字符串为“aabaaaab”,当 k=2 时,后缀分成了 4 组,如图 5
所示。

Selection_004

容易看出,有希望成为最长公共前缀不小于 k 的两个后缀一定在同一组。然
后对于每组后缀,只须判断每个后缀的 sa 值的最大值和最小值之差是否不小于
k。如果有一组满足,则说明存在,否则不存在。整个做法的时间复杂度为
O(nlogn)。本题中利用 height 值对后缀进行分组的方法很常用,请读者认真体
会。

这是论文中的题目。这个做法的确想不到,不过很好理解。

如果没有不允许重叠的条件就变成了求所有height[i]中的最大值,而每个height[i]对应的两个后缀的位置是sa[i]和sa[i-1]。

分组使得每组中的height[i]都大于等于k(那height[i]小于k的去哪里了? 因为height[i]是由两个相邻的后缀得到的,如果某两个后缀的height[i]小于k,只需要将这两个后缀分成两组,这样这个height[i]就不存在了,从而保证了每组中的height[i]都大于等于k)

而我们知道,任意两个后缀的最长公共前缀是他们之间的所有height的最小值。因为对于处于同一组内的两个后缀来说,由于之前保证了每组中的height[i]>=k,也就是保证了任意两个后缀的最长公共前缀大于等于k.

因此用二分判定长度k的时候,这样分组以后,只需要再判断是否相交(也就是如果长度k不满足,可能是因为没有办法分组使得每个height都大于等于k,也可能是存在这样的分组,但是两个后缀相交)。

判断相交其实非常简单,sa[i]表示的是排名第i的后缀的开始位置,那么如果存在sa[j]-sa[i]>=k(其实是sa[i]+k-1<sa[j],sa[i]位置开始的后缀的长度为k的前缀的最后一个字符的所在位置sa[i]+k-1比sa[j]小,就说明不相交,由于是整数,就可以变成sa[i]+k<=sa[j],也就是sa[j]-sa[i]>-=k)

而某一组内只要有一组i,j,满足sa[j]-sa[i]>=k就是有解,因此我们只需要判断最可能符合条件的一组,也就是找到一组中sa[i]的最大值和最小值,也正因为我们这样,我们在具体实现的过程中也没必要真的模拟分组的过程,只需要一直更新两个极值即可。

 

以及:lrj的板子是错的!!会re!!!! 已改正。

其他细节见代码注释

 

 

 

 

 

 

ural 1517. Freedom of Choice (后缀数组,最长公共子串)

ural1517
题意:给出两个字符串,求最长的公共字串(要求出具体的字符串是什么)

思路:依然是后缀数组,在更新长度 的时候记录起始位置即可,1A。以及,发现多开了一个完全没有必要的数组w[],这次已删。

 

20160730update:模板已更正,lrj的模板的rk[i]为0 的时候会出现re的问题…已特判。

poj 2774 Long Long Message (最长公共字串,后缀数组模板题)

poj2774

题意:给出两个字符串,问最长的公共连续字串。

思路:后缀数组模板题。

具体可以参考两篇国集论文(09,04)
topcoder中的讲解
codechef上的讲解
还有一篇讲 dc3算法的论文:SuffixArrays_dc3
这里不谈具体的后缀数组的学习内容,说说大概的学习过程。

首先要理解后缀,后缀数组(sa[]),名次数组(rk[]),height数组,lcp 这些概念

先从定义入手,得到sa数组的n2logn的求法…

由于复杂度爆炸,所以有了两个算法来优化求sa的过程,一个是nlogn的倍增,还有一个是O(n)的dc3。。。

倍增的算法中用到了radix sort

上面这些,都是在说如何求sa,但是如果只有sa一个数组的话,就没有办法很好感受 后缀数组的power.

于是引入了height数组。

定义 height[i]=suffix(sa[i-1])和 suffix(sa[i])的最长公
共前缀,也就是排名相邻的两个后缀的最长公共前缀。那么对于 j 和 k,不妨设
rank[j]<rank[k],则有以下性质:
suffix(j) 和 suffix(k) 的 最 长 公 共 前 缀 为 height[rank[j]+1],
height[rank[j]+2], height[rank[j]+3], … ,height[rank[k]]中的最小值。

有很多问题都是基于height数组的,慢慢感受。

 

再说这道题:我们可以把两个字符串中间用一个特殊符号连接起来。

那么两个字符串的最长公共字串,就变成了求合并后的字符串的所有后缀的最长公共前缀。(原因是字符串的任何一个字串都可以看成是该字符串的某个后缀的前缀

那么容易知道,该最长公共前缀的长度一定是某个height值(原因是,height[i]表示的是排名相邻的两个后缀的最长公共前缀的长度,如果不相邻,那么取的是他们排名之间所有height的最小值,只会越来越小。

还需要注意的是,必须满足得到该height的两个后缀分别出现在原来的两个字符串中…

要怎么办到呢? 其实很容易,由于sa[i]数组存放的是排名第i的后缀是后缀几(定义从第x个字符开始的后缀就是后缀x)

设初始第一个字符串的长度为len1,那么如果是第一个字符串的后缀,一定有sa[i]<len1,如果是第二个字符串的后缀,就一定有sa[i]>len1  (sa[i]==len1的是插入的特殊符号开始的后缀)

 

还有一些细节可以参考代码注释

 

UD20160730:改正了lrj书中的错误。。对于rk[i]==0的情况进行了特判。。不然会re…

 

 

 

 

suffix array (转自 codechef)

原文链接:链接

讲了后缀数组的概念,然后从最暴力的O(n*n*logn )的复杂度(O(n)用来比较字符串,O(nlogn)是排序的复杂度)逐步优化,依据各个串之间的关系,大概讲了倍增算法,以及给出了一篇The Skew Algorithm 的论文。

文中实现的倍增算法的复杂度是O(Nlog2N)的。。是因为作者不会基数排序23333。

 

This text will focus on the construction of Suffix Arrays, it will aim to explain what they are and what they are used for and hopefully some examples will be provided (it will be mainly simple applications so that the concepts don’t get too attached to the theoretical explanation).

As usual, this follows my somewhat recent series of tutorials in order to make the reference post with links as complete as possible!

  • What is a Suffix Array?

In simple terms, a suffix array is just a sorted array of all the suffixes of a given string.

As a data structure, it is widely used in areas such as data compression, bioinformatics and, in general, in any area that deals with strings and string matching problems, so, as you can see, it is of great importance to know efficient algorithms to construct a suffix array for a given string.

Please note that on this context, the name suffix is the exact same thing as substring, as you can see from the wikipedia link provided.

A suffix array will contain integers that represent the starting indexes of the all the suffixes of a given string, after the aforementioned suffixes are sorted.

On some applications of suffix arrays, it is common to paddle the string with a special character (like #, @ or $) that is not present on the alphabet that is being used to represent the string and, as such, it’s considered to be smaller than all the other characters. (The reason why these special characters are used will hopefully be clearer ahead in this text)

And, as a picture it’s worth more than a thousand words, below is a small scheme which represents the several suffixes of a string (on the left) along with the suffix array for the same string (on the right). The original string is attcatg$.

The above picture describes what we want to do, and our goal with this text will be to explore different ways of doing this in the hope of obtaining a good solution.

We will enumerate some popular algorithms for this task and will actually implement some of them in C++ (as you will see, some of them are trivial to implement but can be too slow, while others have faster execution times at the cost of both implementation and memory complexity).

  • The naive algorithm

We shall begin our exploration of this very interesting topic by first studying the most naive algorithm available to solve our problem, which is also the most simple one to implement.

The key idea of the naive algorithm is using a good comparison-based sorting algorithm to sort all the suffixes of a given string in the fastest possible way. Quick-sort does this task very well.

However we should remind ourselves that we are sorting strings, so, either we use the overloaded < sign to serve as a “comparator” for strings (this is done internally in C++ for the string data type) or we write our own string comparison function, which is basically the same thing regarding time complexity, with the former alternative consuming us more time on the writing of code. As such, on my own implementation I chose to keep things simple and used the built-in sort() function applied to a vector of strings. As to compare two strings, we are forced to iterate over all its characters, the time complexity to compare strings is O(N), which means that:

On the naive approach, we are sorting N strings with an O(N log N) comparison based sorting algorithm. As comparing strings takes O(N) time, we can conclude that the time complexity of our naive approach is O(N2 log N)

After sorting all the strings, we need to be able to “retrieve” the original index that each string had initially so we can actually build the suffix array itself.

[Sidenote: As written on the image, the indexes just “come along for the ride”.

To do this, I simply used a map as an auxiliary data structure, such that the keys are the strings that will map to the values which are the original indexes the strings had on the original array. Now, retrieving these values is trivial.]

Below, you can find the code for the naive algorithm for constructing the Suffix Array of a given string entered by the user as input:

As you can see by the above code snippet, the implementation of the naive approach is pretty straightforward and very robust as little to virtually no space for errors is allowed if one uses built-in sorting functions.

However, such simplicity comes with an associated cost, and on this case, such cost is paid with a relatively high time complexity which is actually impractical for most problems. So, we need to tune up this approach a bit and attempt to devise a better algorithm.

This is what will be done on the next section.

  • A clever approach of building the Suffix Array of a given string

As noted above, Suffix Array construction is simple, but an efficient Suffix Array construction is hard.

However, after some thinking we can actually have a very defined idea of why we are performing so badly on such construction.

The reason why we are doing badly on the construction of the SA is because we are NOT EXPLOITING the fact that the strings we are sorting, are actually all part of the SAME original string, and not random, unrelated strings.

However, how can this observation help us?

This observation can help us greatly because now we can actually use tuples that contain only some characters of the string (which we will group in powers of two) such that we can sort the strings in a more ordered fashion by their first two characters, then we can improve on and sort them by their first four characters and so on, until we have reached a length such that we can be sure all the strings are themselves sorted.

With this observation at hand, we can actually cut down the execution time of our SA construction algorithm from O(N2 log N) to O(N log2 N).

Using the amazing work done by @gamabunta, I can provide his explanation of this approach, along with his pseudo-code and later improve a little bit upon it by actually providing an actual C++ implementation of this idea:

@gamabunta‘s work

Let us consider the original array or suffixes, sorted only according to the first 2 character. If the first 2 character is the same, we consider that the strings have the same sort index.

Now, we wish to use the above array, and sort the suffixes according to their first 4 characters. To achieve this, we can assign 2-tuples to each string. The first value in the 2-tuple is the sort-index of the respective suffix, from above. The second value in the 2-tuple is the sort-indexof the suffix that starts 2 positions later, again from above.

If the length of the suffix is less than 2 characters, then we can keep the second value in the 2-tuple as -1.

Now, we can call quick-sort and sort the suffixes according to their first 4 characters by using the 2-tuples we constructed above! The result would be:

Similarly constructing the 2-tuples and performing quick-sort again will give us suffixes sorted by their first 8 characters.

Thus, we can sort the suffixes by the following pseudo-code

The above algorithm will find the Suffix Array in O(N log2 N).

end of @gamabunta‘s work

Below you can find a C++ implementation of the above pseudo-code:

 

 

This concludes the explanation of a more efficient approach on building the suffix array for a given string. The runtime is, as said above, O(N log2N).

  • Constructing (and explaining) the LCP array

The LCP array (Longest Common Prefix) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes between pairs of consecutive suffixes in the suffix array.

So, if one has built the Suffix Array, it’s relatively simple to actually build the LCP array.

In fact, using once again @gamabunta‘s amazing work, below there is the pseudo-code which allows one to efficiently find the LCP array:

We can use the SortIndex array we constructed above to find the Longest Common Prefix, between any two prefixes.

The LCP Array is the array of Longest Common Prefixes between the ith suffix and the (i-1)th suffix in the Suffix Array. The above algorithm needs to be called N times to build the LCP Array in a total of O(N log N) time.

  • Moving on from here

This post was actually the first long post I wrote about a subject which I’m not familiar with, AT ALL. This is always a risk I am also taking, but I tried to adhere only to the sub-topics I considered I mastered relatively well myself (at least, in theory, as I still don’t think I could implement this correctly on a live contest or even a pratice problem… But, as I said many times, I’m here to work as hard as I can to learn as much as I can!)

I hope that what I wrote is, at least, decent and I did it basically as a good way of gathering information which is very spread over many papers and websites online, so that when people read this post they will be able to grasp the ideas for the naive solution as well as for the improvement presented as a better solution.

There are many interesting linear algorithms to “attack” this problem, with one of the most famous being the Skew Algorithm, which is lovely described on the link I provide here.

Besides this, there are several other algorithms which are also linear that exploit the relationship between Suffix Trees and the Suffix Array and that use linear sorting algorithm like radix sort, but, which I sadly don’t yet understand which makes me unable to discuss them here.

However, I hope this little text does its job by at least gathering some useful information on a single post 🙂

I am also learning as I write these texts and this has helped me a lot on my evolution as a coder and I hope I can keep contributing to give all my best to this incredible cmmunity 😀

Best regards,

Bruno Oliveira